Rusty Russell, IBM Corporation
-1-September-2010
+14-September-2010
Abstract
argument. Additional arguments to open would require the
introduction of a tdb_open_ex2 call etc.
-2.1.1 Proposed Solution
+2.1.1 Proposed Solution<attributes>
tdb_open() will take a linked-list of attributes:
The aim is that building tdb with -DTDB_PTHREAD will result in a
pthread-safe version of the library, and otherwise no overhead
-will exist.
+will exist. Alternatively, a hooking mechanism similar to that
+proposed for [Proposed-Solution-locking-hook] could be used to
+enable pthread locking at runtime.
2.8 *_nonblock Functions And *_mark Functions Expose
Implementation
Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but
see [TDB_CLEAR_IF_FIRST-Imposes-Performance].
+2.15 Extending The Header Is Difficult
+
+We have reserved (zeroed) words in the TDB header, which can be
+used for future features. If the future features are compulsory,
+the version number must be updated to prevent old code from
+accessing the database. But if the future feature is optional, we
+have no way of telling if older code is accessing the database or
+not.
+
+2.15.1 Proposed Solution
+
+The header should contain a “format variant” value (64-bit). This
+is divided into two 32-bit parts:
+
+1. The lower part reflects the format variant understood by code
+ accessing the database.
+
+2. The upper part reflects the format variant you must understand
+ to write to the database (otherwise you can only open for
+ reading).
+
+The latter field can only be written at creation time, the former
+should be written under the OPEN_LOCK when opening the database
+for writing, if the variant of the code is lower than the current
+lowest variant.
+
+This should allow backwards-compatible features to be added, and
+detection if older code (which doesn't understand the feature)
+writes to the database.
+
+2.16 Record Headers Are Not Expandible
+
+If we later want to add (say) checksums on keys and data, it
+would require another format change, which we'd like to avoid.
+
+2.16.1 Proposed Solution
+
+We often have extra padding at the tail of a record. If we ensure
+that the first byte (if any) of this padding is zero, we will
+have a way for future changes to detect code which doesn't
+understand a new format: the new code would write (say) a 1 at
+the tail, and thus if there is no tail or the first byte is 0, we
+would know the extension is not present on that record.
+
+2.17 TDB Does Not Use Talloc
+
+Many users of TDB (particularly Samba) use the talloc allocator,
+and thus have to wrap TDB in a talloc context to use it
+conveniently.
+
+2.17.1 Proposed Solution
+
+The allocation within TDB is not complicated enough to justify
+the use of talloc, and I am reluctant to force another
+(excellent) library on TDB users. Nonetheless a compromise is
+possible. An attribute (see [attributes]) can be added later to
+tdb_open() to provide an alternate allocation mechanism,
+specifically for talloc but usable by any other allocator (which
+would ignore the “context” argument).
+
+This would form a talloc heirarchy as expected, but the caller
+would still have to attach a destructor to the tdb context
+returned from tdb_open to close it. All TDB_DATA fields would be
+children of the tdb_context, and the caller would still have to
+manage them (using talloc_free() or talloc_steal()).
+
3 Performance And Scalability Issues
3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST
random zone”, but that's less common). It could be done with as
few as 4 bits from the record header.[footnote:
Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives
-the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
+the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
+Given the zone size for the zone the current record is in, we can
+determine the start of the zone.
]
3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
3.9.1 Proposed Solution
-None. At some point you say “use a real database”.
+None. At some point you say “use a real database” (but see [replay-attribute]
+).
But as a thought experiment, if we implemented transactions to
only overwrite free entries (this is tricky: there must not be a
3.10.1 Proposed Solution
-We could solve a small part of the problem by providing read-only
-transactions. These would allow one write transaction to begin,
-but it could not commit until all r/o transactions are done. This
-would require a new RO_TRANSACTION_LOCK, which would be upgraded
-on commit.
+None (but see [replay-attribute]). We could solve a small part of
+the problem by providing read-only transactions. These would
+allow one write transaction to begin, but it could not commit
+until all r/o transactions are done. This would require a new
+RO_TRANSACTION_LOCK, which would be upgraded on commit.
3.11 Default Hash Function Is Suboptimal
levels, and apply the transactions until it encountered an
invalid checksum.
+3.15 Tracing Is Fragile, Replay Is External
+
+The current TDB has compile-time-enabled tracing code, but it
+often breaks as it is not enabled by default. In a similar way,
+the ctdb code has an external wrapper which does replay tracing
+so it can coordinate cluster-wide transactions.
+
+3.15.1 Proposed Solution<replay-attribute>
+
+Tridge points out that an attribute can be later added to
+tdb_open (see [attributes]) to provide replay/trace hooks, which
+could become the basis for this and future parallel transactions
+and snapshot support.
+