Rusty Russell, IBM Corporation
-26-July-2010
+14-September-2010
Abstract
argument. Additional arguments to open would require the
introduction of a tdb_open_ex2 call etc.
-2.1.1 Proposed Solution
+2.1.1 Proposed Solution<attributes>
tdb_open() will take a linked-list of attributes:
The aim is that building tdb with -DTDB_PTHREAD will result in a
pthread-safe version of the library, and otherwise no overhead
-will exist.
+will exist. Alternatively, a hooking mechanism similar to that
+proposed for [Proposed-Solution-locking-hook] could be used to
+enable pthread locking at runtime.
2.8 *_nonblock Functions And *_mark Functions Expose
Implementation
Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but
see [TDB_CLEAR_IF_FIRST-Imposes-Performance].
+2.15 Extending The Header Is Difficult
+
+We have reserved (zeroed) words in the TDB header, which can be
+used for future features. If the future features are compulsory,
+the version number must be updated to prevent old code from
+accessing the database. But if the future feature is optional, we
+have no way of telling if older code is accessing the database or
+not.
+
+2.15.1 Proposed Solution
+
+The header should contain a “format variant” value (64-bit). This
+is divided into two 32-bit parts:
+
+1. The lower part reflects the format variant understood by code
+ accessing the database.
+
+2. The upper part reflects the format variant you must understand
+ to write to the database (otherwise you can only open for
+ reading).
+
+The latter field can only be written at creation time, the former
+should be written under the OPEN_LOCK when opening the database
+for writing, if the variant of the code is lower than the current
+lowest variant.
+
+This should allow backwards-compatible features to be added, and
+detection if older code (which doesn't understand the feature)
+writes to the database.
+
+2.16 Record Headers Are Not Expandible
+
+If we later want to add (say) checksums on keys and data, it
+would require another format change, which we'd like to avoid.
+
+2.16.1 Proposed Solution
+
+We often have extra padding at the tail of a record. If we ensure
+that the first byte (if any) of this padding is zero, we will
+have a way for future changes to detect code which doesn't
+understand a new format: the new code would write (say) a 1 at
+the tail, and thus if there is no tail or the first byte is 0, we
+would know the extension is not present on that record.
+
+2.17 TDB Does Not Use Talloc
+
+Many users of TDB (particularly Samba) use the talloc allocator,
+and thus have to wrap TDB in a talloc context to use it
+conveniently.
+
+2.17.1 Proposed Solution
+
+The allocation within TDB is not complicated enough to justify
+the use of talloc, and I am reluctant to force another
+(excellent) library on TDB users. Nonetheless a compromise is
+possible. An attribute (see [attributes]) can be added later to
+tdb_open() to provide an alternate allocation mechanism,
+specifically for talloc but usable by any other allocator (which
+would ignore the “context” argument).
+
+This would form a talloc heirarchy as expected, but the caller
+would still have to attach a destructor to the tdb context
+returned from tdb_open to close it. All TDB_DATA fields would be
+children of the tdb_context, and the caller would still have to
+manage them (using talloc_free() or talloc_steal()).
+
3 Performance And Scalability Issues
3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST
impossible to know what the 'right' answer is at database
creation time.
-3.4.1 Proposed Solution
+3.4.1 <sub:Hash-Size-Solution>Proposed Solution
After comprehensive performance testing on various scalable hash
variants[footnote:
This was annoying because I was previously convinced that an
expanding tree of hashes would be very close to optimal.
], it became clear that it is hard to beat a straight linear hash
-table which doubles in size when it reaches saturation. There are
-three details which become important:
+table which doubles in size when it reaches saturation.
+
+1.
+
+2.
+
+3.
-1. On encountering a full bucket, we use the next bucket.
-2. Extra hash bits are stored with the offset, to reduce
- comparisons.
-3. A marker entry is used on deleting an entry.
-The doubling of the table must be done under a transaction; we
-will not reduce it on deletion, so it will be an unusual case. It
-will either be placed at the head (other entries will be moved
-out the way so we can expand). We could have a pointer in the
-header to the current hashtable location, but that pointer would
-have to be read frequently to check for hashtable moves.
-The locking for this is slightly more complex than the chained
-case; we currently have one lock per bucket, and that means we
-would need to expand the lock if we overflow to the next bucket.
-The frequency of such collisions will effect our locking
-heuristics: we can always lock more buckets than we need.
+ Unfortunately, altering the hash table introduces serious
+locking complications: the entire hash table needs to be locked
+to enlarge the hash table, and others might be holding locks.
+Particularly insidious are insertions done under tdb_chainlock.
-One possible optimization is to only re-check the hash size on an
-insert or a lookup miss.
+Thus an expanding layered hash will be used: an array of hash
+groups, with each hash group exploding into pointers to lower
+hash groups once it fills, turning into a hash tree. This has
+implications for locking: we must lock the entire group in case
+we need to expand it, yet we don't know how deep the tree is at
+that point.
+
+Note that bits from the hash table entries should be stolen to
+hold more hash bits to reduce the penalty of collisions. We can
+use the otherwise-unused lower 3 bits. If we limit the size of
+the database to 64 exabytes, we can use the top 8 bits of the
+hash entry as well. These 11 bits would reduce false positives
+down to 1 in 2000 which is more than we need: we can use one of
+the bits to indicate that the extra hash bits are valid. This
+means we can choose not to re-hash all entries when we expand a
+hash group; simply use the next bits we need and mark them
+invalid.
3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended
number of free list entries we can use far fewer, say 1/32 of the
number of hash buckets.
+It seems tempting to try to reuse the hash implementation which
+we use for records here, but we have two ways of searching for
+free entries: for allocation we search by size (and possibly
+zone) which produces too many clashes for our hash table to
+handle well, and for coalescing we search by address. Thus an
+array of doubly-linked free lists seems preferable.
+
There are various benefits in using per-size free lists (see [sub:TDB-Becomes-Fragmented]
) but it's not clear this would reduce contention in the common
case where all processes are allocating/freeing the same size.
be small, but it might be worth using one free entry to hold
pointers to the others for cache efficiency.
+<freelist-in-zone>If we want to avoid locking complexity
+(enlarging the free lists when we enlarge the file) we could
+place the array of free lists at the beginning of each zone. This
+means existing array lists never move, but means that a record
+cannot be larger than a zone. That in turn implies that zones
+should be variable sized (say, power of 2), which makes the
+question “what zone is this record in?” much harder (and “pick a
+random zone”, but that's less common). It could be done with as
+few as 4 bits from the record header.[footnote:
+Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives
+the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
+Given the zone size for the zone the current record is in, we can
+determine the start of the zone.
+]
+
3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
Much of this is a result of allocation strategy[footnote:
this is diminishing returns after a handful of bits (at 10
bits, it reduces 99.9% of false memcmp). As an aside, as the
lower bits are already incorporated in the hash table
- resolution, the upper bits should be used here.
+ resolution, the upper bits should be used here. Note that it's
+ not clear that these bits will be a win, given the extra bits
+ in the hash table itself (see [sub:Hash-Size-Solution]).
5. 'magic' does not need to be enlarged: it currently reflects
one of 5 values (used, free, dead, recovery, and
uint64_t total_length;
+ uint64_t prev, next;
+
...
uint64_t tailer;
};
-
+We might want to take some bits from the used record's top_hash
+(and the free record which has 32 bits of padding to spare
+anyway) if we use variable sized zones. See [freelist-in-zone].
3.8 Transaction Commit Requires 4 fdatasync
3.9.1 Proposed Solution
-None. At some point you say “use a real database”.
+None. At some point you say “use a real database” (but see [replay-attribute]
+).
But as a thought experiment, if we implemented transactions to
only overwrite free entries (this is tricky: there must not be a
3.10.1 Proposed Solution
-We could solve a small part of the problem by providing read-only
-transactions. These would allow one write transaction to begin,
-but it could not commit until all r/o transactions are done. This
-would require a new RO_TRANSACTION_LOCK, which would be upgraded
-on commit.
+None (but see [replay-attribute]). We could solve a small part of
+the problem by providing read-only transactions. These would
+allow one write transaction to begin, but it could not commit
+until all r/o transactions are done. This would require a new
+RO_TRANSACTION_LOCK, which would be upgraded on commit.
3.11 Default Hash Function Is Suboptimal
levels, and apply the transactions until it encountered an
invalid checksum.
+3.15 Tracing Is Fragile, Replay Is External
+
+The current TDB has compile-time-enabled tracing code, but it
+often breaks as it is not enabled by default. In a similar way,
+the ctdb code has an external wrapper which does replay tracing
+so it can coordinate cluster-wide transactions.
+
+3.15.1 Proposed Solution<replay-attribute>
+
+Tridge points out that an attribute can be later added to
+tdb_open (see [attributes]) to provide replay/trace hooks, which
+could become the basis for this and future parallel transactions
+and snapshot support.
+