X-Git-Url: https://git.ozlabs.org/?p=ccan;a=blobdiff_plain;f=ccan%2Ftdb2%2Fdoc%2Fdesign.txt;h=233a43abe947e561b0d5e53adcd5184d07125f17;hp=88334a8a4957f53b7d03fbc3f656f673113ded55;hb=d1cea3ebf96f61b5bbac1e74975700770e06add6;hpb=95458bafc9dc99ac8fcd68aa8f48a9fc564e6a31 diff --git a/ccan/tdb2/doc/design.txt b/ccan/tdb2/doc/design.txt index 88334a8a..233a43ab 100644 --- a/ccan/tdb2/doc/design.txt +++ b/ccan/tdb2/doc/design.txt @@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase Rusty Russell, IBM Corporation -1-September-2010 +14-September-2010 Abstract @@ -74,7 +74,7 @@ optional hashing function and an optional logging function argument. Additional arguments to open would require the introduction of a tdb_open_ex2 call etc. -2.1.1 Proposed Solution +2.1.1 Proposed Solution tdb_open() will take a linked-list of attributes: @@ -277,7 +277,9 @@ maintained. The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe version of the library, and otherwise no overhead -will exist. +will exist. Alternatively, a hooking mechanism similar to that +proposed for [Proposed-Solution-locking-hook] could be used to +enable pthread locking at runtime. 2.8 *_nonblock Functions And *_mark Functions Expose Implementation @@ -473,6 +475,72 @@ it alone has opened the TDB and will erase it. Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but see [TDB_CLEAR_IF_FIRST-Imposes-Performance]. +2.15 Extending The Header Is Difficult + +We have reserved (zeroed) words in the TDB header, which can be +used for future features. If the future features are compulsory, +the version number must be updated to prevent old code from +accessing the database. But if the future feature is optional, we +have no way of telling if older code is accessing the database or +not. + +2.15.1 Proposed Solution + +The header should contain a “format variant” value (64-bit). This +is divided into two 32-bit parts: + +1. The lower part reflects the format variant understood by code + accessing the database. + +2. The upper part reflects the format variant you must understand + to write to the database (otherwise you can only open for + reading). + +The latter field can only be written at creation time, the former +should be written under the OPEN_LOCK when opening the database +for writing, if the variant of the code is lower than the current +lowest variant. + +This should allow backwards-compatible features to be added, and +detection if older code (which doesn't understand the feature) +writes to the database. + +2.16 Record Headers Are Not Expandible + +If we later want to add (say) checksums on keys and data, it +would require another format change, which we'd like to avoid. + +2.16.1 Proposed Solution + +We often have extra padding at the tail of a record. If we ensure +that the first byte (if any) of this padding is zero, we will +have a way for future changes to detect code which doesn't +understand a new format: the new code would write (say) a 1 at +the tail, and thus if there is no tail or the first byte is 0, we +would know the extension is not present on that record. + +2.17 TDB Does Not Use Talloc + +Many users of TDB (particularly Samba) use the talloc allocator, +and thus have to wrap TDB in a talloc context to use it +conveniently. + +2.17.1 Proposed Solution + +The allocation within TDB is not complicated enough to justify +the use of talloc, and I am reluctant to force another +(excellent) library on TDB users. Nonetheless a compromise is +possible. An attribute (see [attributes]) can be added later to +tdb_open() to provide an alternate allocation mechanism, +specifically for talloc but usable by any other allocator (which +would ignore the “context” argument). + +This would form a talloc heirarchy as expected, but the caller +would still have to attach a destructor to the tdb context +returned from tdb_open to close it. All TDB_DATA fields would be +children of the tdb_context, and the caller would still have to +manage them (using talloc_free() or talloc_steal()). + 3 Performance And Scalability Issues 3.1 TDB_CLEAR_IF_FIRST @@ -744,7 +812,9 @@ question “what zone is this record in?” much harder (and “pick a random zone”, but that's less common). It could be done with as few as 4 bits from the record header.[footnote: Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives -the maximal 2^{61} byte zone. Zones range in factor of 8 steps. +the maximal 2^{61} byte zone. Zones range in factor of 8 steps. +Given the zone size for the zone the current record is in, we can +determine the start of the zone. ] 3.6 TDB Becomes Fragmented @@ -963,7 +1033,8 @@ we need only check for recovery if this is set. 3.9.1 Proposed Solution -None. At some point you say “use a real database”. +None. At some point you say “use a real database” (but see [replay-attribute] +). But as a thought experiment, if we implemented transactions to only overwrite free entries (this is tricky: there must not be a @@ -992,11 +1063,11 @@ failed. 3.10.1 Proposed Solution -We could solve a small part of the problem by providing read-only -transactions. These would allow one write transaction to begin, -but it could not commit until all r/o transactions are done. This -would require a new RO_TRANSACTION_LOCK, which would be upgraded -on commit. +None (but see [replay-attribute]). We could solve a small part of +the problem by providing read-only transactions. These would +allow one write transaction to begin, but it could not commit +until all r/o transactions are done. This would require a new +RO_TRANSACTION_LOCK, which would be upgraded on commit. 3.11 Default Hash Function Is Suboptimal @@ -1091,3 +1162,17 @@ filled). On crash, tdb_open() would examine the array of top levels, and apply the transactions until it encountered an invalid checksum. +3.15 Tracing Is Fragile, Replay Is External + +The current TDB has compile-time-enabled tracing code, but it +often breaks as it is not enabled by default. In a similar way, +the ctdb code has an external wrapper which does replay tracing +so it can coordinate cluster-wide transactions. + +3.15.1 Proposed Solution + +Tridge points out that an attribute can be later added to +tdb_open (see [attributes]) to provide replay/trace hooks, which +could become the basis for this and future parallel transactions +and snapshot support. +