From: Rusty Russell Date: Tue, 14 Sep 2010 00:36:04 +0000 (+0930) Subject: tdb2: update design doc. X-Git-Url: https://git.ozlabs.org/?p=ccan;a=commitdiff_plain;h=b8d05b195bfa10cb2a5b21985536ea45350029d5 tdb2: update design doc. --- diff --git a/ccan/tdb2/doc/design.lyx b/ccan/tdb2/doc/design.lyx index 8f061a7c..ca17f8fe 100644 --- a/ccan/tdb2/doc/design.lyx +++ b/ccan/tdb2/doc/design.lyx @@ -53,8 +53,8 @@ Rusty Russell, IBM Corporation \change_deleted 0 1283307542 26-July -\change_inserted 0 1284016854 -9-September +\change_inserted 0 1284423485 +14-September \change_unchanged -2010 \end_layout @@ -476,6 +476,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional \begin_layout Subsubsection Proposed Solution +\change_inserted 0 1284422789 + +\begin_inset CommandInset label +LatexCommand label +name "attributes" + +\end_inset + + +\change_unchanged + \end_layout \begin_layout Standard @@ -1289,13 +1300,69 @@ Proposed Solution \begin_layout Standard -\change_inserted 0 1284016847 +\change_inserted 0 1284422552 We often have extra padding at the tail of a record. If we ensure that the first byte (if any) of this padding is zero, we will have a way for future changes to detect code which doesn't understand a new format: the new code would write (say) a 1 at the tail, and thus if there is no tail or the first byte is 0, we would know the extension is not present on that record. +\end_layout + +\begin_layout Subsection + +\change_inserted 0 1284422568 +TDB Does Not Use Talloc +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284422646 +Many users of TDB (particularly Samba) use the talloc allocator, and thus + have to wrap TDB in a talloc context to use it conveniently. +\end_layout + +\begin_layout Subsubsection + +\change_inserted 0 1284422656 +Proposed Solution +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284423065 +The allocation within TDB is not complicated enough to justify the use of + talloc, and I am reluctant to force another (excellent) library on TDB + users. + Nonetheless a compromise is possible. + An attribute (see +\begin_inset CommandInset ref +LatexCommand ref +reference "attributes" + +\end_inset + +) can be added later to tdb_open() to provide an alternate allocation mechanism, + specifically for talloc but usable by any other allocator (which would + ignore the +\begin_inset Quotes eld +\end_inset + +context +\begin_inset Quotes erd +\end_inset + + argument). +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284423042 +This would form a talloc heirarchy as expected, but the caller would still + have to attach a destructor to the tdb context returned from tdb_open to + close it. + All TDB_DATA fields would be children of the tdb_context, and the caller + would still have to manage them (using talloc_free() or talloc_steal()). \change_unchanged \end_layout @@ -1875,7 +1942,7 @@ status open \begin_layout Plain Layout -\change_inserted 0 1283310945 +\change_inserted 0 1284424151 Using \begin_inset Formula $2^{16+N*3}$ \end_inset @@ -1886,6 +1953,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal byte zone. Zones range in factor of 8 steps. + Given the zone size for the zone the current record is in, we can determine + the start of the zone. \change_unchanged \end_layout @@ -2330,6 +2399,8 @@ TDB Does Not Have Snapshot Support \begin_layout Subsubsection Proposed Solution +\change_deleted 0 1284423472 + \end_layout \begin_layout Standard @@ -2342,7 +2413,23 @@ use a real database \begin_inset Quotes erd \end_inset + +\change_inserted 0 1284423891 + +\change_deleted 0 1284423891 . + +\change_inserted 0 1284423901 + (but see +\begin_inset CommandInset ref +LatexCommand ref +reference "replay-attribute" + +\end_inset + +). +\change_unchanged + \end_layout \begin_layout Standard @@ -2365,6 +2452,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack \begin_layout Standard We could then implement snapshots using a similar method, using multiple different hash tables/free tables. +\change_inserted 0 1284423495 + \end_layout \begin_layout Subsection @@ -2384,6 +2473,18 @@ Proposed Solution \end_layout \begin_layout Standard + +\change_inserted 0 1284424201 +None (but see +\begin_inset CommandInset ref +LatexCommand ref +reference "replay-attribute" + +\end_inset + +). + +\change_unchanged We could solve a small part of the problem by providing read-only transactions. These would allow one write transaction to begin, but it could not commit until all r/o transactions are done. @@ -2569,6 +2670,53 @@ At some later point, a sync would allow recovery of the old data into the free lists (perhaps when the array of top-level pointers filled). On crash, tdb_open() would examine the array of top levels, and apply the transactions until it encountered an invalid checksum. +\change_inserted 0 1284423555 + +\end_layout + +\begin_layout Subsection + +\change_inserted 0 1284423617 +Tracing Is Fragile, Replay Is External +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284423719 +The current TDB has compile-time-enabled tracing code, but it often breaks + as it is not enabled by default. + In a similar way, the ctdb code has an external wrapper which does replay + tracing so it can coordinate cluster-wide transactions. +\end_layout + +\begin_layout Subsubsection + +\change_inserted 0 1284423864 +Proposed Solution +\begin_inset CommandInset label +LatexCommand label +name "replay-attribute" + +\end_inset + + +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284423850 +Tridge points out that an attribute can be later added to tdb_open (see + +\begin_inset CommandInset ref +LatexCommand ref +reference "attributes" + +\end_inset + +) to provide replay/trace hooks, which could become the basis for this and + future parallel transactions and snapshot support. +\change_unchanged + \end_layout \end_body diff --git a/ccan/tdb2/doc/design.lyx,v b/ccan/tdb2/doc/design.lyx,v index 54005d48..70fe70e2 100644 --- a/ccan/tdb2/doc/design.lyx,v +++ b/ccan/tdb2/doc/design.lyx,v @@ -1,10 +1,15 @@ -head 1.9; +head 1.10; access; symbols; locks; strict; comment @# @; +1.10 +date 2010.09.14.00.33.57; author rusty; state Exp; +branches; +next 1.9; + 1.9 date 2010.09.09.07.25.12; author rusty; state Exp; branches; @@ -56,9 +61,9 @@ desc @ -1.9 +1.10 log -@Extension mechanism. +@Tracing attribute, talloc support. @ text @#LyX 1.6.5 created this file. For more info see http://www.lyx.org/ @@ -116,8 +121,8 @@ Rusty Russell, IBM Corporation \change_deleted 0 1283307542 26-July -\change_inserted 0 1284016854 -9-September +\change_inserted 0 1284423485 +14-September \change_unchanged -2010 \end_layout @@ -539,6 +544,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional \begin_layout Subsubsection Proposed Solution +\change_inserted 0 1284422789 + +\begin_inset CommandInset label +LatexCommand label +name "attributes" + +\end_inset + + +\change_unchanged + \end_layout \begin_layout Standard @@ -1352,13 +1368,69 @@ Proposed Solution \begin_layout Standard -\change_inserted 0 1284016847 +\change_inserted 0 1284422552 We often have extra padding at the tail of a record. If we ensure that the first byte (if any) of this padding is zero, we will have a way for future changes to detect code which doesn't understand a new format: the new code would write (say) a 1 at the tail, and thus if there is no tail or the first byte is 0, we would know the extension is not present on that record. +\end_layout + +\begin_layout Subsection + +\change_inserted 0 1284422568 +TDB Does Not Use Talloc +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284422646 +Many users of TDB (particularly Samba) use the talloc allocator, and thus + have to wrap TDB in a talloc context to use it conveniently. +\end_layout + +\begin_layout Subsubsection + +\change_inserted 0 1284422656 +Proposed Solution +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284423065 +The allocation within TDB is not complicated enough to justify the use of + talloc, and I am reluctant to force another (excellent) library on TDB + users. + Nonetheless a compromise is possible. + An attribute (see +\begin_inset CommandInset ref +LatexCommand ref +reference "attributes" + +\end_inset + +) can be added later to tdb_open() to provide an alternate allocation mechanism, + specifically for talloc but usable by any other allocator (which would + ignore the +\begin_inset Quotes eld +\end_inset + +context +\begin_inset Quotes erd +\end_inset + + argument). +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284423042 +This would form a talloc heirarchy as expected, but the caller would still + have to attach a destructor to the tdb context returned from tdb_open to + close it. + All TDB_DATA fields would be children of the tdb_context, and the caller + would still have to manage them (using talloc_free() or talloc_steal()). \change_unchanged \end_layout @@ -1938,7 +2010,7 @@ status open \begin_layout Plain Layout -\change_inserted 0 1283310945 +\change_inserted 0 1284424151 Using \begin_inset Formula $2^{16+N*3}$ \end_inset @@ -1949,6 +2021,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal byte zone. Zones range in factor of 8 steps. + Given the zone size for the zone the current record is in, we can determine + the start of the zone. \change_unchanged \end_layout @@ -2393,6 +2467,8 @@ TDB Does Not Have Snapshot Support \begin_layout Subsubsection Proposed Solution +\change_deleted 0 1284423472 + \end_layout \begin_layout Standard @@ -2405,7 +2481,23 @@ use a real database \begin_inset Quotes erd \end_inset + +\change_inserted 0 1284423891 + +\change_deleted 0 1284423891 . + +\change_inserted 0 1284423901 + (but see +\begin_inset CommandInset ref +LatexCommand ref +reference "replay-attribute" + +\end_inset + +). +\change_unchanged + \end_layout \begin_layout Standard @@ -2428,6 +2520,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack \begin_layout Standard We could then implement snapshots using a similar method, using multiple different hash tables/free tables. +\change_inserted 0 1284423495 + \end_layout \begin_layout Subsection @@ -2447,6 +2541,18 @@ Proposed Solution \end_layout \begin_layout Standard + +\change_inserted 0 1284424201 +None (but see +\begin_inset CommandInset ref +LatexCommand ref +reference "replay-attribute" + +\end_inset + +). + +\change_unchanged We could solve a small part of the problem by providing read-only transactions. These would allow one write transaction to begin, but it could not commit until all r/o transactions are done. @@ -2632,6 +2738,53 @@ At some later point, a sync would allow recovery of the old data into the free lists (perhaps when the array of top-level pointers filled). On crash, tdb_open() would examine the array of top levels, and apply the transactions until it encountered an invalid checksum. +\change_inserted 0 1284423555 + +\end_layout + +\begin_layout Subsection + +\change_inserted 0 1284423617 +Tracing Is Fragile, Replay Is External +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284423719 +The current TDB has compile-time-enabled tracing code, but it often breaks + as it is not enabled by default. + In a similar way, the ctdb code has an external wrapper which does replay + tracing so it can coordinate cluster-wide transactions. +\end_layout + +\begin_layout Subsubsection + +\change_inserted 0 1284423864 +Proposed Solution +\begin_inset CommandInset label +LatexCommand label +name "replay-attribute" + +\end_inset + + +\end_layout + +\begin_layout Standard + +\change_inserted 0 1284423850 +Tridge points out that an attribute can be later added to tdb_open (see + +\begin_inset CommandInset ref +LatexCommand ref +reference "attributes" + +\end_inset + +) to provide replay/trace hooks, which could become the basis for this and + future parallel transactions and snapshot support. +\change_unchanged + \end_layout \end_body @@ -2639,6 +2792,33 @@ At some later point, a sync would allow recovery of the old data into the @ +1.9 +log +@Extension mechanism. +@ +text +@d56 2 +a57 2 +\change_inserted 0 1284016854 +9-September +d479 11 +d1303 1 +a1303 1 +\change_inserted 0 1284016847 +d1310 56 +d1945 1 +a1945 1 +\change_inserted 0 1283310945 +d1956 2 +d2402 2 +d2416 4 +d2421 12 +d2455 2 +d2476 12 +d2673 47 +@ + + 1.8 log @Remove bogus footnote diff --git a/ccan/tdb2/doc/design.txt b/ccan/tdb2/doc/design.txt index 967c0b09..233a43ab 100644 --- a/ccan/tdb2/doc/design.txt +++ b/ccan/tdb2/doc/design.txt @@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase Rusty Russell, IBM Corporation -9-September-2010 +14-September-2010 Abstract @@ -74,7 +74,7 @@ optional hashing function and an optional logging function argument. Additional arguments to open would require the introduction of a tdb_open_ex2 call etc. -2.1.1 Proposed Solution +2.1.1 Proposed Solution tdb_open() will take a linked-list of attributes: @@ -519,6 +519,28 @@ understand a new format: the new code would write (say) a 1 at the tail, and thus if there is no tail or the first byte is 0, we would know the extension is not present on that record. +2.17 TDB Does Not Use Talloc + +Many users of TDB (particularly Samba) use the talloc allocator, +and thus have to wrap TDB in a talloc context to use it +conveniently. + +2.17.1 Proposed Solution + +The allocation within TDB is not complicated enough to justify +the use of talloc, and I am reluctant to force another +(excellent) library on TDB users. Nonetheless a compromise is +possible. An attribute (see [attributes]) can be added later to +tdb_open() to provide an alternate allocation mechanism, +specifically for talloc but usable by any other allocator (which +would ignore the “context” argument). + +This would form a talloc heirarchy as expected, but the caller +would still have to attach a destructor to the tdb context +returned from tdb_open to close it. All TDB_DATA fields would be +children of the tdb_context, and the caller would still have to +manage them (using talloc_free() or talloc_steal()). + 3 Performance And Scalability Issues 3.1 TDB_CLEAR_IF_FIRST @@ -790,7 +812,9 @@ question “what zone is this record in?” much harder (and “pick a random zone”, but that's less common). It could be done with as few as 4 bits from the record header.[footnote: Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives -the maximal 2^{61} byte zone. Zones range in factor of 8 steps. +the maximal 2^{61} byte zone. Zones range in factor of 8 steps. +Given the zone size for the zone the current record is in, we can +determine the start of the zone. ] 3.6 TDB Becomes Fragmented @@ -1009,7 +1033,8 @@ we need only check for recovery if this is set. 3.9.1 Proposed Solution -None. At some point you say “use a real database”. +None. At some point you say “use a real database” (but see [replay-attribute] +). But as a thought experiment, if we implemented transactions to only overwrite free entries (this is tricky: there must not be a @@ -1038,11 +1063,11 @@ failed. 3.10.1 Proposed Solution -We could solve a small part of the problem by providing read-only -transactions. These would allow one write transaction to begin, -but it could not commit until all r/o transactions are done. This -would require a new RO_TRANSACTION_LOCK, which would be upgraded -on commit. +None (but see [replay-attribute]). We could solve a small part of +the problem by providing read-only transactions. These would +allow one write transaction to begin, but it could not commit +until all r/o transactions are done. This would require a new +RO_TRANSACTION_LOCK, which would be upgraded on commit. 3.11 Default Hash Function Is Suboptimal @@ -1137,3 +1162,17 @@ filled). On crash, tdb_open() would examine the array of top levels, and apply the transactions until it encountered an invalid checksum. +3.15 Tracing Is Fragile, Replay Is External + +The current TDB has compile-time-enabled tracing code, but it +often breaks as it is not enabled by default. In a similar way, +the ctdb code has an external wrapper which does replay tracing +so it can coordinate cluster-wide transactions. + +3.15.1 Proposed Solution + +Tridge points out that an attribute can be later added to +tdb_open (see [attributes]) to provide replay/trace hooks, which +could become the basis for this and future parallel transactions +and snapshot support. +