1 #LyX 1.6.5 created this file. For more info see http://www.lyx.org/
6 \use_default_options true
11 \font_typewriter default
12 \font_default_family default
19 \paperfontsize default
27 \paperorientation portrait
30 \paragraph_separation indent
32 \quotes_language english
35 \paperpagestyle default
36 \tracking_changes true
38 \author "Rusty Russell,,,"
45 TDB2: A Redesigning The Trivial DataBase
49 Rusty Russell, IBM Corporation
54 \change_deleted 0 1283307542
56 \change_inserted 0 1284423485
62 \begin_layout Abstract
63 The Trivial DataBase on-disk format is 32 bits; with usage cases heading
64 towards the 4G limit, that must change.
65 This required breakage provides an opportunity to revisit TDB's other design
66 decisions and reassess them.
73 \begin_layout Standard
74 The Trivial DataBase was originally written by Andrew Tridgell as a simple
75 key/data pair storage system with the same API as dbm, but allowing multiple
76 readers and writers while being small enough (< 1000 lines of C) to include
78 The simple design created in 1999 has proven surprisingly robust and performant
79 , used in Samba versions 3 and 4 as well as numerous other projects.
80 Its useful life was greatly increased by the (backwards-compatible!) addition
81 of transaction support in 2005.
84 \begin_layout Standard
85 The wider variety and greater demands of TDB-using code has lead to some
86 organic growth of the API, as well as some compromises on the implementation.
87 None of these, by themselves, are seen as show-stoppers, but the cumulative
88 effect is to a loss of elegance over the initial, simple TDB implementation.
89 Here is a table of the approximate number of lines of implementation code
90 and number of API functions at the end of each year:
93 \begin_layout Standard
95 <lyxtabular version="3" rows="12" columns="3">
97 <column alignment="center" valignment="top" width="0">
98 <column alignment="center" valignment="top" width="0">
99 <column alignment="center" valignment="top" width="0">
101 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
104 \begin_layout Plain Layout
110 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
113 \begin_layout Plain Layout
119 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
122 \begin_layout Plain Layout
123 Lines of C Code Implementation
130 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
133 \begin_layout Plain Layout
139 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
142 \begin_layout Plain Layout
148 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
151 \begin_layout Plain Layout
159 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
162 \begin_layout Plain Layout
168 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
171 \begin_layout Plain Layout
177 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
180 \begin_layout Plain Layout
188 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
191 \begin_layout Plain Layout
197 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
200 \begin_layout Plain Layout
206 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
209 \begin_layout Plain Layout
217 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
220 \begin_layout Plain Layout
226 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
229 \begin_layout Plain Layout
235 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
238 \begin_layout Plain Layout
246 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
249 \begin_layout Plain Layout
255 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
258 \begin_layout Plain Layout
264 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
267 \begin_layout Plain Layout
275 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
278 \begin_layout Plain Layout
284 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
287 \begin_layout Plain Layout
293 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
296 \begin_layout Plain Layout
304 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
307 \begin_layout Plain Layout
313 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
316 \begin_layout Plain Layout
322 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
325 \begin_layout Plain Layout
333 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
336 \begin_layout Plain Layout
342 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
345 \begin_layout Plain Layout
351 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
354 \begin_layout Plain Layout
362 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
365 \begin_layout Plain Layout
371 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
374 \begin_layout Plain Layout
380 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
383 \begin_layout Plain Layout
391 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
394 \begin_layout Plain Layout
400 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
403 \begin_layout Plain Layout
409 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
412 \begin_layout Plain Layout
420 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
423 \begin_layout Plain Layout
429 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
432 \begin_layout Plain Layout
438 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
441 \begin_layout Plain Layout
455 \begin_layout Standard
456 This review is an attempt to catalog and address all the known issues with
457 TDB and create solutions which address the problems without significantly
458 increasing complexity; all involved are far too aware of the dangers of
459 second system syndrome in rewriting a successful project like this.
462 \begin_layout Section
466 \begin_layout Subsection
467 tdb_open_ex Is Not Expandable
470 \begin_layout Standard
471 The tdb_open() call was expanded to tdb_open_ex(), which added an optional
472 hashing function and an optional logging function argument.
473 Additional arguments to open would require the introduction of a tdb_open_ex2
477 \begin_layout Subsubsection
479 \change_inserted 0 1284422789
481 \begin_inset CommandInset label
492 \begin_layout Standard
493 tdb_open() will take a linked-list of attributes:
496 \begin_layout LyX-Code
500 \begin_layout LyX-Code
501 TDB_ATTRIBUTE_LOG = 0,
504 \begin_layout LyX-Code
505 TDB_ATTRIBUTE_HASH = 1
508 \begin_layout LyX-Code
512 \begin_layout LyX-Code
513 struct tdb_attribute_base {
516 \begin_layout LyX-Code
517 enum tdb_attribute attr;
520 \begin_layout LyX-Code
521 union tdb_attribute *next;
524 \begin_layout LyX-Code
528 \begin_layout LyX-Code
529 struct tdb_attribute_log {
532 \begin_layout LyX-Code
533 struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG */
536 \begin_layout LyX-Code
540 \begin_layout LyX-Code
544 \begin_layout LyX-Code
548 \begin_layout LyX-Code
549 struct tdb_attribute_hash {
552 \begin_layout LyX-Code
553 struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH */
556 \begin_layout LyX-Code
557 tdb_hash_func hash_fn;
560 \begin_layout LyX-Code
564 \begin_layout LyX-Code
568 \begin_layout LyX-Code
569 union tdb_attribute {
572 \begin_layout LyX-Code
573 struct tdb_attribute_base base;
576 \begin_layout LyX-Code
577 struct tdb_attribute_log log;
580 \begin_layout LyX-Code
581 struct tdb_attribute_hash hash;
584 \begin_layout LyX-Code
588 \begin_layout Standard
589 This allows future attributes to be added, even if this expands the size
593 \begin_layout Subsection
594 tdb_traverse Makes Impossible Guarantees
597 \begin_layout Standard
598 tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, and it
599 was thought that it was important to guarantee that all records which exist
600 at the start and end of the traversal would be included, and no record
601 would be included twice.
604 \begin_layout Standard
605 This adds complexity (see
606 \begin_inset CommandInset ref
608 reference "Reliable-Traversal-Adds"
612 ) and does not work anyway for records which are altered (in particular,
613 those which are expanded may be effectively deleted and re-added behind
617 \begin_layout Subsubsection
618 \begin_inset CommandInset label
620 name "traverse-Proposed-Solution"
627 \begin_layout Standard
628 Abandon the guarantee.
629 You will see every record if no changes occur during your traversal, otherwise
630 you will see some subset.
631 You can prevent changes by using a transaction or the locking API.
634 \begin_layout Subsection
635 Nesting of Transactions Is Fraught
638 \begin_layout Standard
639 TDB has alternated between allowing nested transactions and not allowing
641 Various paths in the Samba codebase assume that transactions will nest,
642 and in a sense they can: the operation is only committed to disk when the
643 outer transaction is committed.
644 There are two problems, however:
647 \begin_layout Enumerate
648 Canceling the inner transaction will cause the outer transaction commit
649 to fail, and will not undo any operations since the inner transaction began.
650 This problem is soluble with some additional internal code.
653 \begin_layout Enumerate
654 An inner transaction commit can be cancelled by the outer transaction.
655 This is desirable in the way which Samba's database initialization code
656 uses transactions, but could be a surprise to any users expecting a successful
657 transaction commit to expose changes to others.
660 \begin_layout Standard
661 The current solution is to specify the behavior at tdb_open(), with the
662 default currently that nested transactions are allowed.
663 This flag can also be changed at runtime.
666 \begin_layout Subsubsection
670 \begin_layout Standard
671 Given the usage patterns, it seems that the
672 \begin_inset Quotes eld
676 \begin_inset Quotes erd
679 behavior of disallowing nested transactions should become the default.
680 Additionally, it seems the outer transaction is the only code which knows
681 whether inner transactions should be allowed, so a flag to indicate this
682 could be added to tdb_transaction_start.
683 However, this behavior can be simulated with a wrapper which uses tdb_add_flags
684 () and tdb_remove_flags(), so the API should not be expanded for this relatively
688 \begin_layout Subsection
689 Incorrect Hash Function is Not Detected
692 \begin_layout Standard
693 tdb_open_ex() allows the calling code to specify a different hash function
694 to use, but does not check that all other processes accessing this tdb
695 are using the same hash function.
696 The result is that records are missing from tdb_fetch().
699 \begin_layout Subsubsection
703 \begin_layout Standard
704 The header should contain an example hash result (eg.
705 the hash of 0xdeadbeef), and tdb_open_ex() should check that the given
706 hash function produces the same answer, or fail the tdb_open call.
709 \begin_layout Subsection
710 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
713 \begin_layout Standard
714 In response to scalability issues with the free list (
715 \begin_inset CommandInset ref
717 reference "TDB-Freelist-Is"
721 ) two API workarounds have been incorporated in TDB: tdb_set_max_dead()
722 and the TDB_VOLATILE flag to tdb_open.
723 The latter actually calls the former with an argument of
724 \begin_inset Quotes eld
728 \begin_inset Quotes erd
734 \begin_layout Standard
735 This code allows deleted records to accumulate without putting them in the
737 On delete we iterate through each chain and free them in a batch if there
738 are more than max_dead entries.
739 These are never otherwise recycled except as a side-effect of a tdb_repack.
742 \begin_layout Subsubsection
746 \begin_layout Standard
747 With the scalability problems of the freelist solved, this API can be removed.
748 The TDB_VOLATILE flag may still be useful as a hint that store and delete
749 of records will be at least as common as fetch in order to allow some internal
750 tuning, but initially will become a no-op.
753 \begin_layout Subsection
754 \begin_inset CommandInset label
756 name "TDB-Files-Cannot"
760 TDB Files Cannot Be Opened Multiple Times In The Same Process
763 \begin_layout Standard
764 No process can open the same TDB twice; we check and disallow it.
765 This is an unfortunate side-effect of fcntl locks, which operate on a per-file
766 rather than per-file-descriptor basis, and do not nest.
767 Thus, closing any file descriptor on a file clears all the locks obtained
768 by this process, even if they were placed using a different file descriptor!
771 \begin_layout Standard
772 Note that even if this were solved, deadlock could occur if operations were
773 nested: this is a more manageable programming error in most cases.
776 \begin_layout Subsubsection
780 \begin_layout Standard
781 We could lobby POSIX to fix the perverse rules, or at least lobby Linux
782 to violate them so that the most common implementation does not have this
784 This would be a generally good idea for other fcntl lock users.
787 \begin_layout Standard
788 Samba uses a wrapper which hands out the same tdb_context to multiple callers
789 if this happens, and does simple reference counting.
790 We should do this inside the tdb library, which already emulates lock nesting
791 internally; it would need to recognize when deadlock occurs within a single
793 This would create a new failure mode for tdb operations (while we currently
794 handle locking failures, they are impossible in normal use and a process
795 encountering them can do little but give up).
798 \begin_layout Standard
799 I do not see benefit in an additional tdb_open flag to indicate whether
800 re-opening is allowed, as though there may be some benefit to adding a
801 call to detect when a tdb_context is shared, to allow other to create such
805 \begin_layout Subsection
806 TDB API Is Not POSIX Thread-safe
809 \begin_layout Standard
810 The TDB API uses an error code which can be queried after an operation to
811 determine what went wrong.
812 This programming model does not work with threads, unless specific additional
813 guarantees are given by the implementation.
814 In addition, even otherwise-independent threads cannot open the same TDB
816 \begin_inset CommandInset ref
818 reference "TDB-Files-Cannot"
825 \begin_layout Subsubsection
829 \begin_layout Standard
830 Reachitecting the API to include a tdb_errcode pointer would be a great
831 deal of churn; we are better to guarantee that the tdb_errcode is per-thread
832 so the current programming model can be maintained.
835 \begin_layout Standard
836 This requires dynamic per-thread allocations, which is awkward with POSIX
837 threads (pthread_key_create space is limited and we cannot simply allocate
838 a key for every TDB).
841 \begin_layout Standard
842 Internal locking is required to make sure that fcntl locks do not overlap
843 between threads, and also that the global list of tdbs is maintained.
846 \begin_layout Standard
847 The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe
848 version of the library, and otherwise no overhead will exist.
850 \change_inserted 0 1284016998
851 Alternatively, a hooking mechanism similar to that proposed for
852 \begin_inset CommandInset ref
854 reference "Proposed-Solution-locking-hook"
858 could be used to enable pthread locking at runtime.
863 \begin_layout Subsection
864 *_nonblock Functions And *_mark Functions Expose Implementation
867 \begin_layout Standard
872 \begin_layout Plain Layout
873 Clustered TDB, see http://ctdb.samba.org
878 wishes to operate on TDB in a non-blocking manner.
879 This is currently done as follows:
882 \begin_layout Enumerate
883 Call the _nonblock variant of an API function (eg.
884 tdb_lockall_nonblock).
888 \begin_layout Enumerate
889 Fork a child process, and wait for it to call the normal variant (eg.
893 \begin_layout Enumerate
894 If the child succeeds, call the _mark variant to indicate we already have
899 \begin_layout Enumerate
900 Upon completion, tell the child to release the locks (eg.
904 \begin_layout Enumerate
905 Indicate to tdb that it should consider the locks removed (eg.
909 \begin_layout Standard
910 There are several issues with this approach.
911 Firstly, adding two new variants of each function clutters the API for
912 an obscure use, and so not all functions have three variants.
913 Secondly, it assumes that all paths of the functions ask for the same locks,
914 otherwise the parent process will have to get a lock which the child doesn't
915 have under some circumstances.
916 I don't believe this is currently the case, but it constrains the implementatio
921 \begin_layout Subsubsection
922 \begin_inset CommandInset label
924 name "Proposed-Solution-locking-hook"
931 \begin_layout Standard
932 Implement a hook for locking methods, so that the caller can control the
933 calls to create and remove fcntl locks.
934 In this scenario, ctdbd would operate as follows:
937 \begin_layout Enumerate
938 Call the normal API function, eg tdb_lockall().
941 \begin_layout Enumerate
942 When the lock callback comes in, check if the child has the lock.
943 Initially, this is always false.
945 Otherwise, try to obtain it in non-blocking mode.
946 If that fails, return EWOULDBLOCK.
949 \begin_layout Enumerate
950 Release locks in the unlock callback as normal.
953 \begin_layout Enumerate
954 If tdb_lockall() fails, see if we recorded a lock failure; if so, call the
955 child to repeat the operation.
958 \begin_layout Enumerate
959 The child records what locks it obtains, and returns that information to
963 \begin_layout Enumerate
964 When the child has succeeded, goto 1.
967 \begin_layout Standard
968 This is flexible enough to handle any potential locking scenario, even when
969 lock requirements change.
970 It can be optimized so that the parent does not release locks, just tells
971 the child which locks it doesn't need to obtain.
974 \begin_layout Standard
975 It also keeps the complexity out of the API, and in ctdbd where it is needed.
978 \begin_layout Subsection
979 tdb_chainlock Functions Expose Implementation
982 \begin_layout Standard
983 tdb_chainlock locks some number of records, including the record indicated
985 This gave atomicity guarantees; no-one can start a transaction, alter,
986 read or delete that key while the lock is held.
989 \begin_layout Standard
990 It also makes the same guarantee for any other key in the chain, which is
991 an internal implementation detail and potentially a cause for deadlock.
994 \begin_layout Subsubsection
998 \begin_layout Standard
1000 It would be nice to have an explicit single entry lock which effected no
1002 Unfortunately, this won't work for an entry which doesn't exist.
1003 Thus while chainlock may be implemented more efficiently for the existing
1004 case, it will still have overlap issues with the non-existing case.
1005 So it is best to keep the current (lack of) guarantee about which records
1006 will be effected to avoid constraining our implementation.
1009 \begin_layout Subsection
1010 Signal Handling is Not Race-Free
1013 \begin_layout Standard
1014 The tdb_setalarm_sigptr() call allows the caller's signal handler to indicate
1015 that the tdb locking code should return with a failure, rather than trying
1016 again when a signal is received (and errno == EAGAIN).
1017 This is usually used to implement timeouts.
1020 \begin_layout Standard
1021 Unfortunately, this does not work in the case where the signal is received
1022 before the tdb code enters the fcntl() call to place the lock: the code
1023 will sleep within the fcntl() code, unaware that the signal wants it to
1025 In the case of long timeouts, this does not happen in practice.
1028 \begin_layout Subsubsection
1032 \begin_layout Standard
1033 The locking hooks proposed in
1034 \begin_inset CommandInset ref
1036 reference "Proposed-Solution-locking-hook"
1040 would allow the user to decide on whether to fail the lock acquisition
1042 This allows the caller to choose their own compromise: they could narrow
1043 the race by checking immediately before the fcntl call.
1047 \begin_layout Plain Layout
1048 It may be possible to make this race-free in some implementations by having
1049 the signal handler alter the struct flock to make it invalid.
1050 This will cause the fcntl() lock call to fail with EINVAL if the signal
1051 occurs before the kernel is entered, otherwise EAGAIN.
1059 \begin_layout Subsection
1060 The API Uses Gratuitous Typedefs, Capitals
1063 \begin_layout Standard
1064 typedefs are useful for providing source compatibility when types can differ
1065 across implementations, or arguably in the case of function pointer definitions
1066 which are hard for humans to parse.
1067 Otherwise it is simply obfuscation and pollutes the namespace.
1070 \begin_layout Standard
1071 Capitalization is usually reserved for compile-time constants and macros.
1074 \begin_layout Description
1075 TDB_CONTEXT There is no reason to use this over 'struct tdb_context'; the
1076 definition isn't visible to the API user anyway.
1079 \begin_layout Description
1080 TDB_DATA There is no reason to use this over struct TDB_DATA; the struct
1081 needs to be understood by the API user.
1084 \begin_layout Description
1086 \begin_inset space ~
1089 TDB_DATA This would normally be called 'struct tdb_data'.
1092 \begin_layout Description
1094 \begin_inset space ~
1097 TDB_ERROR Similarly, this would normally be enum tdb_error.
1100 \begin_layout Subsubsection
1104 \begin_layout Standard
1106 Introducing lower case variants would please pedants like myself, but if
1107 it were done the existing ones should be kept.
1108 There is little point forcing a purely cosmetic change upon tdb users.
1111 \begin_layout Subsection
1112 \begin_inset CommandInset label
1114 name "tdb_log_func-Doesnt-Take"
1118 tdb_log_func Doesn't Take The Private Pointer
1121 \begin_layout Standard
1122 For API compatibility reasons, the logging function needs to call tdb_get_loggin
1123 g_private() to retrieve the pointer registered by the tdb_open_ex for logging.
1126 \begin_layout Subsubsection
1130 \begin_layout Standard
1131 It should simply take an extra argument, since we are prepared to break
1135 \begin_layout Subsection
1136 Various Callback Functions Are Not Typesafe
1139 \begin_layout Standard
1140 The callback functions in tdb_set_logging_function (after
1141 \begin_inset CommandInset ref
1143 reference "tdb_log_func-Doesnt-Take"
1147 is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read and tdb_check
1148 all take void * and must internally convert it to the argument type they
1152 \begin_layout Standard
1153 If this type changes, the compiler will not produce warnings on the callers,
1154 since it only sees void *.
1157 \begin_layout Subsubsection
1161 \begin_layout Standard
1162 With careful use of macros, we can create callback functions which give
1163 a warning when used on gcc and the types of the callback and its private
1165 Unsupported compilers will not give a warning, which is no worse than now.
1166 In addition, the callbacks become clearer, as they need not use void *
1167 for their parameter.
1170 \begin_layout Standard
1171 See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html
1174 \begin_layout Subsection
1175 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic
1178 \begin_layout Standard
1179 The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB file should
1180 be cleared if the caller discovers it is the only process with the TDB
1182 However, if any caller does not specify TDB_CLEAR_IF_FIRST it will not
1183 be detected, so will have the TDB erased underneath them (usually resulting
1187 \begin_layout Standard
1188 There is a similar issue on fork(); if the parent exits (or otherwise closes
1189 the tdb) before the child calls tdb_reopen_all() to establish the lock
1190 used to indicate the TDB is opened by someone, a TDB_CLEAR_IF_FIRST opener
1191 at that moment will believe it alone has opened the TDB and will erase
1195 \begin_layout Subsubsection
1199 \begin_layout Standard
1200 Remove TDB_CLEAR_IF_FIRST.
1201 Other workarounds are possible, but see
1202 \begin_inset CommandInset ref
1204 reference "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1209 \change_inserted 0 1284015637
1213 \begin_layout Subsection
1215 \change_inserted 0 1284015716
1216 Extending The Header Is Difficult
1219 \begin_layout Standard
1221 \change_inserted 0 1284015906
1222 We have reserved (zeroed) words in the TDB header, which can be used for
1224 If the future features are compulsory, the version number must be updated
1225 to prevent old code from accessing the database.
1226 But if the future feature is optional, we have no way of telling if older
1227 code is accessing the database or not.
1230 \begin_layout Subsubsection
1232 \change_inserted 0 1284015637
1236 \begin_layout Standard
1238 \change_inserted 0 1284016114
1239 The header should contain a
1240 \begin_inset Quotes eld
1244 \begin_inset Quotes erd
1248 This is divided into two 32-bit parts:
1251 \begin_layout Enumerate
1253 \change_inserted 0 1284016149
1254 The lower part reflects the format variant understood by code accessing
1258 \begin_layout Enumerate
1260 \change_inserted 0 1284016639
1261 The upper part reflects the format variant you must understand to write
1262 to the database (otherwise you can only open for reading).
1265 \begin_layout Standard
1267 \change_inserted 0 1284016821
1268 The latter field can only be written at creation time, the former should
1269 be written under the OPEN_LOCK when opening the database for writing, if
1270 the variant of the code is lower than the current lowest variant.
1273 \begin_layout Standard
1275 \change_inserted 0 1284016803
1276 This should allow backwards-compatible features to be added, and detection
1277 if older code (which doesn't understand the feature) writes to the database.
1278 \change_deleted 0 1284016101
1282 \begin_layout Subsection
1284 \change_inserted 0 1284015634
1285 Record Headers Are Not Expandible
1288 \begin_layout Standard
1290 \change_inserted 0 1284015634
1291 If we later want to add (say) checksums on keys and data, it would require
1292 another format change, which we'd like to avoid.
1295 \begin_layout Subsubsection
1297 \change_inserted 0 1284015634
1301 \begin_layout Standard
1303 \change_inserted 0 1284422552
1304 We often have extra padding at the tail of a record.
1305 If we ensure that the first byte (if any) of this padding is zero, we will
1306 have a way for future changes to detect code which doesn't understand a
1307 new format: the new code would write (say) a 1 at the tail, and thus if
1308 there is no tail or the first byte is 0, we would know the extension is
1309 not present on that record.
1312 \begin_layout Subsection
1314 \change_inserted 0 1284422568
1315 TDB Does Not Use Talloc
1318 \begin_layout Standard
1320 \change_inserted 0 1284422646
1321 Many users of TDB (particularly Samba) use the talloc allocator, and thus
1322 have to wrap TDB in a talloc context to use it conveniently.
1325 \begin_layout Subsubsection
1327 \change_inserted 0 1284422656
1331 \begin_layout Standard
1333 \change_inserted 0 1284423065
1334 The allocation within TDB is not complicated enough to justify the use of
1335 talloc, and I am reluctant to force another (excellent) library on TDB
1337 Nonetheless a compromise is possible.
1339 \begin_inset CommandInset ref
1341 reference "attributes"
1345 ) can be added later to tdb_open() to provide an alternate allocation mechanism,
1346 specifically for talloc but usable by any other allocator (which would
1348 \begin_inset Quotes eld
1352 \begin_inset Quotes erd
1358 \begin_layout Standard
1360 \change_inserted 0 1284423042
1361 This would form a talloc heirarchy as expected, but the caller would still
1362 have to attach a destructor to the tdb context returned from tdb_open to
1364 All TDB_DATA fields would be children of the tdb_context, and the caller
1365 would still have to manage them (using talloc_free() or talloc_steal()).
1370 \begin_layout Section
1371 Performance And Scalability Issues
1374 \begin_layout Subsection
1375 \begin_inset CommandInset label
1377 name "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1381 TDB_CLEAR_IF_FIRST Imposes Performance Penalty
1384 \begin_layout Standard
1385 When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is placed at offset
1388 While these locks never conflict in normal tdb usage, they do add substantial
1389 overhead for most fcntl lock implementations when the kernel scans to detect
1390 if a lock conflict exists.
1391 This is often a single linked list, making the time to acquire and release
1392 a fcntl lock O(N) where N is the number of processes with the TDB open,
1393 not the number actually doing work.
1396 \begin_layout Standard
1397 In a Samba server it is common to have huge numbers of clients sitting idle,
1398 and thus they have weaned themselves off the TDB_CLEAR_IF_FIRST flag.
1402 \begin_layout Plain Layout
1403 There is a flag to tdb_reopen_all() which is used for this optimization:
1404 if the parent process will outlive the child, the child does not need the
1406 This is a workaround for this very performance issue.
1414 \begin_layout Subsubsection
1418 \begin_layout Standard
1420 It was a neat idea, but even trivial servers tend to know when they are
1421 initializing for the first time and can simply unlink the old tdb at that
1425 \begin_layout Subsection
1426 TDB Files Have a 4G Limit
1429 \begin_layout Standard
1430 This seems to be becoming an issue (so much for
1431 \begin_inset Quotes eld
1435 \begin_inset Quotes erd
1438 !), particularly for ldb.
1441 \begin_layout Subsubsection
1445 \begin_layout Standard
1446 A new, incompatible TDB format which uses 64 bit offsets internally rather
1448 For simplicity of endian conversion (which TDB does on the fly if required),
1449 all values will be 64 bit on disk.
1450 In practice, some upper bits may be used for other purposes, but at least
1451 56 bits will be available for file offsets.
1454 \begin_layout Standard
1455 tdb_open() will automatically detect the old version, and even create them
1456 if TDB_VERSION6 is specified to tdb_open.
1459 \begin_layout Standard
1460 32 bit processes will still be able to access TDBs larger than 4G (assuming
1461 that their off_t allows them to seek to 64 bits), they will gracefully
1462 fall back as they fail to mmap.
1463 This can happen already with large TDBs.
1466 \begin_layout Standard
1467 Old versions of tdb will fail to open the new TDB files (since 28 August
1468 2009, commit 398d0c29290: prior to that any unrecognized file format would
1469 be erased and initialized as a fresh tdb!)
1472 \begin_layout Subsection
1473 TDB Records Have a 4G Limit
1476 \begin_layout Standard
1477 This has not been a reported problem, and the API uses size_t which can
1478 be 64 bit on 64 bit platforms.
1479 However, other limits may have made such an issue moot.
1482 \begin_layout Subsubsection
1486 \begin_layout Standard
1487 Record sizes will be 64 bit, with an error returned on 32 bit platforms
1488 which try to access such records (the current implementation would return
1489 TDB_ERR_OOM in a similar case).
1490 It seems unlikely that 32 bit keys will be a limitation, so the implementation
1491 may not support this (see
1492 \begin_inset CommandInset ref
1494 reference "sub:Records-Incur-A"
1501 \begin_layout Subsection
1502 Hash Size Is Determined At TDB Creation Time
1505 \begin_layout Standard
1506 TDB contains a number of hash chains in the header; the number is specified
1507 at creation time, and defaults to 131.
1508 This is such a bottleneck on large databases (as each hash chain gets quite
1509 long), that LDB uses 10,000 for this hash.
1510 In general it is impossible to know what the 'right' answer is at database
1514 \begin_layout Subsubsection
1516 \change_inserted 0 1283336713
1517 \begin_inset CommandInset label
1519 name "sub:Hash-Size-Solution"
1528 \begin_layout Standard
1529 After comprehensive performance testing on various scalable hash variants
1533 \begin_layout Plain Layout
1534 http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying
1535 because I was previously convinced that an expanding tree of hashes would
1536 be very close to optimal.
1541 , it became clear that it is hard to beat a straight linear hash table which
1542 doubles in size when it reaches saturation.
1544 \change_deleted 0 1283307675
1545 There are three details which become important:
1548 \begin_layout Enumerate
1550 \change_deleted 0 1283307675
1551 On encountering a full bucket, we use the next bucket.
1554 \begin_layout Enumerate
1556 \change_deleted 0 1283307675
1557 Extra hash bits are stored with the offset, to reduce comparisons.
1560 \begin_layout Enumerate
1562 \change_deleted 0 1283307675
1563 A marker entry is used on deleting an entry.
1566 \begin_layout Standard
1568 \change_deleted 0 1283307675
1569 The doubling of the table must be done under a transaction; we will not
1570 reduce it on deletion, so it will be an unusual case.
1571 It will either be placed at the head (other entries will be moved out the
1572 way so we can expand).
1573 We could have a pointer in the header to the current hashtable location,
1574 but that pointer would have to be read frequently to check for hashtable
1578 \begin_layout Standard
1580 \change_deleted 0 1283307675
1581 The locking for this is slightly more complex than the chained case; we
1582 currently have one lock per bucket, and that means we would need to expand
1583 the lock if we overflow to the next bucket.
1584 The frequency of such collisions will effect our locking heuristics: we
1585 can always lock more buckets than we need.
1588 \begin_layout Standard
1590 \change_deleted 0 1283307675
1591 One possible optimization is to only re-check the hash size on an insert
1594 \change_inserted 0 1283307770
1595 Unfortunately, altering the hash table introduces serious locking complications
1596 : the entire hash table needs to be locked to enlarge the hash table, and
1597 others might be holding locks.
1598 Particularly insidious are insertions done under tdb_chainlock.
1601 \begin_layout Standard
1603 \change_inserted 0 1283336187
1604 Thus an expanding layered hash will be used: an array of hash groups, with
1605 each hash group exploding into pointers to lower hash groups once it fills,
1606 turning into a hash tree.
1607 This has implications for locking: we must lock the entire group in case
1608 we need to expand it, yet we don't know how deep the tree is at that point.
1611 \begin_layout Standard
1613 \change_inserted 0 1283336586
1614 Note that bits from the hash table entries should be stolen to hold more
1615 hash bits to reduce the penalty of collisions.
1616 We can use the otherwise-unused lower 3 bits.
1617 If we limit the size of the database to 64 exabytes, we can use the top
1618 8 bits of the hash entry as well.
1619 These 11 bits would reduce false positives down to 1 in 2000 which is more
1620 than we need: we can use one of the bits to indicate that the extra hash
1622 This means we can choose not to re-hash all entries when we expand a hash
1623 group; simply use the next bits we need and mark them invalid.
1628 \begin_layout Subsection
1629 \begin_inset CommandInset label
1631 name "TDB-Freelist-Is"
1635 TDB Freelist Is Highly Contended
1638 \begin_layout Standard
1639 TDB uses a single linked list for the free list.
1640 Allocation occurs as follows, using heuristics which have evolved over
1644 \begin_layout Enumerate
1645 Get the free list lock for this whole operation.
1648 \begin_layout Enumerate
1649 Multiply length by 1.25, so we always over-allocate by 25%.
1652 \begin_layout Enumerate
1653 Set the slack multiplier to 1.
1656 \begin_layout Enumerate
1657 Examine the current freelist entry: if it is > length but < the current
1658 best case, remember it as the best case.
1661 \begin_layout Enumerate
1662 Multiply the slack multiplier by 1.05.
1665 \begin_layout Enumerate
1666 If our best fit so far is less than length * slack multiplier, return it.
1667 The slack will be turned into a new free record if it's large enough.
1670 \begin_layout Enumerate
1671 Otherwise, go onto the next freelist entry.
1674 \begin_layout Standard
1675 Deleting a record occurs as follows:
1678 \begin_layout Enumerate
1679 Lock the hash chain for this whole operation.
1682 \begin_layout Enumerate
1683 Walk the chain to find the record, keeping the prev pointer offset.
1686 \begin_layout Enumerate
1687 If max_dead is non-zero:
1691 \begin_layout Enumerate
1692 Walk the hash chain again and count the dead records.
1695 \begin_layout Enumerate
1696 If it's more than max_dead, bulk free all the dead ones (similar to steps
1697 4 and below, but the lock is only obtained once).
1700 \begin_layout Enumerate
1701 Simply mark this record as dead and return.
1706 \begin_layout Enumerate
1707 Get the free list lock for the remainder of this operation.
1710 \begin_layout Enumerate
1711 \begin_inset CommandInset label
1713 name "right-merging"
1717 Examine the following block to see if it is free; if so, enlarge the current
1718 block and remove that block from the free list.
1719 This was disabled, as removal from the free list was O(entries-in-free-list).
1722 \begin_layout Enumerate
1723 Examine the preceeding block to see if it is free: for this reason, each
1724 block has a 32-bit tailer which indicates its length.
1725 If it is free, expand it to cover our new block and return.
1728 \begin_layout Enumerate
1729 Otherwise, prepend ourselves to the free list.
1732 \begin_layout Standard
1733 Disabling right-merging (step
1734 \begin_inset CommandInset ref
1736 reference "right-merging"
1740 ) causes fragmentation; the other heuristics proved insufficient to address
1741 this, so the final answer to this was that when we expand the TDB file
1742 inside a transaction commit, we repack the entire tdb.
1745 \begin_layout Standard
1746 The single list lock limits our allocation rate; due to the other issues
1747 this is not currently seen as a bottleneck.
1750 \begin_layout Subsubsection
1752 \change_deleted 0 1283336858
1756 \begin_layout Standard
1757 The first step is to remove all the current heuristics, as they obviously
1758 interact, then examine them once the lock contention is addressed.
1761 \begin_layout Standard
1762 The free list must be split to reduce contention.
1763 Assuming perfect free merging, we can at most have 1 free list entry for
1765 This implies that the number of free lists is related to the size of the
1766 hash table, but as it is rare to walk a large number of free list entries
1767 we can use far fewer, say 1/32 of the number of hash buckets.
1768 \change_inserted 0 1283336910
1772 \begin_layout Standard
1774 \change_inserted 0 1283337052
1775 It seems tempting to try to reuse the hash implementation which we use for
1776 records here, but we have two ways of searching for free entries: for allocatio
1777 n we search by size (and possibly zone) which produces too many clashes
1778 for our hash table to handle well, and for coalescing we search by address.
1779 Thus an array of doubly-linked free lists seems preferable.
1784 \begin_layout Standard
1785 There are various benefits in using per-size free lists (see
1786 \begin_inset CommandInset ref
1788 reference "sub:TDB-Becomes-Fragmented"
1792 ) but it's not clear this would reduce contention in the common case where
1793 all processes are allocating/freeing the same size.
1794 Thus we almost certainly need to divide in other ways: the most obvious
1795 is to divide the file into zones, and using a free list (or set of free
1797 This approximates address ordering.
1800 \begin_layout Standard
1801 Note that this means we need to split the free lists when we expand the
1802 file; this is probably acceptable when we double the hash table size, since
1803 that is such an expensive operation already.
1804 In the case of increasing the file size, there is an optimization we can
1805 use: if we use M in the formula above as the file size rounded up to the
1806 next power of 2, we only need reshuffle free lists when the file size crosses
1807 a power of 2 boundary,
1811 reshuffling the free lists is trivial: we simply merge every consecutive
1815 \begin_layout Standard
1816 The basic algorithm is as follows.
1820 \begin_layout Enumerate
1821 Identify the correct zone.
1824 \begin_layout Enumerate
1825 Lock the corresponding list.
1828 \begin_layout Enumerate
1829 Re-check the zone (we didn't have a lock, sizes could have changed): relock
1833 \begin_layout Enumerate
1834 Place the freed entry in the list for that zone.
1837 \begin_layout Standard
1838 Allocation is a little more complicated, as we perform delayed coalescing
1842 \begin_layout Enumerate
1843 Pick a zone either the zone we last freed into, or based on a
1844 \begin_inset Quotes eld
1848 \begin_inset Quotes erd
1854 \begin_layout Enumerate
1855 Lock the corresponding list.
1858 \begin_layout Enumerate
1859 Re-check the zone: relock if necessary.
1862 \begin_layout Enumerate
1863 If the top entry is -large enough, remove it from the list and return it.
1866 \begin_layout Enumerate
1867 Otherwise, coalesce entries in the list.If there was no entry large enough,
1868 unlock the list and try the next zone.
1871 \begin_layout Enumerate
1872 If no zone satisfies, expand the file.
1875 \begin_layout Standard
1876 This optimizes rapid insert/delete of free list entries by not coalescing
1878 First-fit address ordering ordering seems to be fairly good for keeping
1879 fragmentation low (see
1880 \begin_inset CommandInset ref
1882 reference "sub:TDB-Becomes-Fragmented"
1887 Note that address ordering does not need a tailer to coalesce, though if
1888 we needed one we could have one cheaply: see
1889 \begin_inset CommandInset ref
1891 reference "sub:Records-Incur-A"
1899 \begin_layout Standard
1900 I anticipate that the number of entries in each free zone would be small,
1901 but it might be worth using one free entry to hold pointers to the others
1902 for cache efficiency.
1903 \change_inserted 0 1283309850
1907 \begin_layout Standard
1909 \change_inserted 0 1283337216
1910 \begin_inset CommandInset label
1912 name "freelist-in-zone"
1916 If we want to avoid locking complexity (enlarging the free lists when we
1917 enlarge the file) we could place the array of free lists at the beginning
1919 This means existing array lists never move, but means that a record cannot
1920 be larger than a zone.
1921 That in turn implies that zones should be variable sized (say, power of
1922 2), which makes the question
1923 \begin_inset Quotes eld
1926 what zone is this record in?
1927 \begin_inset Quotes erd
1931 \begin_inset Quotes eld
1935 \begin_inset Quotes erd
1938 , but that's less common).
1939 It could be done with as few as 4 bits from the record header.
1943 \begin_layout Plain Layout
1945 \change_inserted 0 1284424151
1947 \begin_inset Formula $2^{16+N*3}$
1950 means 0 gives a minimal 65536-byte zone, 15 gives the maximal
1951 \begin_inset Formula $2^{61}$
1955 Zones range in factor of 8 steps.
1956 Given the zone size for the zone the current record is in, we can determine
1957 the start of the zone.
1969 \begin_layout Subsection
1970 \begin_inset CommandInset label
1972 name "sub:TDB-Becomes-Fragmented"
1976 TDB Becomes Fragmented
1979 \begin_layout Standard
1980 Much of this is a result of allocation strategy
1984 \begin_layout Plain Layout
1985 The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 ftp://ftp.cs.ute
1986 xas.edu/pub/garbage/malloc/ismm98.ps
1991 and deliberate hobbling of coalescing; internal fragmentation (aka overallocati
1992 on) is deliberately set at 25%, and external fragmentation is only cured
1993 by the decision to repack the entire db when a transaction commit needs
1994 to enlarge the file.
1997 \begin_layout Subsubsection
2001 \begin_layout Standard
2002 The 25% overhead on allocation works in practice for ldb because indexes
2003 tend to expand by one record at a time.
2004 This internal fragmentation can be resolved by having an
2005 \begin_inset Quotes eld
2009 \begin_inset Quotes erd
2012 bit in the header to note entries that have previously expanded, and allocating
2013 more space for them.
2016 \begin_layout Standard
2017 There are is a spectrum of possible solutions for external fragmentation:
2018 one is to use a fragmentation-avoiding allocation strategy such as best-fit
2019 address-order allocator.
2020 The other end of the spectrum would be to use a bump allocator (very fast
2021 and simple) and simply repack the file when we reach the end.
2024 \begin_layout Standard
2025 There are three problems with efficient fragmentation-avoiding allocators:
2026 they are non-trivial, they tend to use a single free list for each size,
2027 and there's no evidence that tdb allocation patterns will match those recorded
2028 for general allocators (though it seems likely).
2031 \begin_layout Standard
2032 Thus we don't spend too much effort on external fragmentation; we will be
2033 no worse than the current code if we need to repack on occasion.
2034 More effort is spent on reducing freelist contention, and reducing overhead.
2037 \begin_layout Subsection
2038 \begin_inset CommandInset label
2040 name "sub:Records-Incur-A"
2044 Records Incur A 28-Byte Overhead
2047 \begin_layout Standard
2048 Each TDB record has a header as follows:
2051 \begin_layout LyX-Code
2055 \begin_layout LyX-Code
2056 tdb_off_t next; /* offset of the next record in the list */
2059 \begin_layout LyX-Code
2060 tdb_len_t rec_len; /* total byte length of record */
2063 \begin_layout LyX-Code
2064 tdb_len_t key_len; /* byte length of key */
2067 \begin_layout LyX-Code
2068 tdb_len_t data_len; /* byte length of data */
2071 \begin_layout LyX-Code
2072 uint32_t full_hash; /* the full 32 bit hash of the key */
2075 \begin_layout LyX-Code
2076 uint32_t magic; /* try to catch errors */
2079 \begin_layout LyX-Code
2080 /* the following union is implied:
2083 \begin_layout LyX-Code
2087 \begin_layout LyX-Code
2088 char record[rec_len];
2091 \begin_layout LyX-Code
2095 \begin_layout LyX-Code
2099 \begin_layout LyX-Code
2100 char data[data_len];
2103 \begin_layout LyX-Code
2107 \begin_layout LyX-Code
2108 uint32_t totalsize; (tailer)
2111 \begin_layout LyX-Code
2115 \begin_layout LyX-Code
2119 \begin_layout LyX-Code
2123 \begin_layout Standard
2124 Naively, this would double to a 56-byte overhead on a 64 bit implementation.
2127 \begin_layout Subsubsection
2131 \begin_layout Standard
2132 We can use various techniques to reduce this for an allocated block:
2135 \begin_layout Enumerate
2136 The 'next' pointer is not required, as we are using a flat hash table.
2139 \begin_layout Enumerate
2140 'rec_len' can instead be expressed as an addition to key_len and data_len
2141 (it accounts for wasted or overallocated length in the record).
2142 Since the record length is always a multiple of 8, we can conveniently
2143 fit it in 32 bits (representing up to 35 bits).
2146 \begin_layout Enumerate
2147 'key_len' and 'data_len' can be reduced.
2148 I'm unwilling to restrict 'data_len' to 32 bits, but instead we can combine
2149 the two into one 64-bit field and using a 5 bit value which indicates at
2150 what bit to divide the two.
2151 Keys are unlikely to scale as fast as data, so I'm assuming a maximum key
2155 \begin_layout Enumerate
2156 'full_hash' is used to avoid a memcmp on the
2157 \begin_inset Quotes eld
2161 \begin_inset Quotes erd
2164 case, but this is diminishing returns after a handful of bits (at 10 bits,
2165 it reduces 99.9% of false memcmp).
2166 As an aside, as the lower bits are already incorporated in the hash table
2167 resolution, the upper bits should be used here.
2169 \change_inserted 0 1283336739
2170 Note that it's not clear that these bits will be a win, given the extra
2171 bits in the hash table itself (see
2172 \begin_inset CommandInset ref
2174 reference "sub:Hash-Size-Solution"
2183 \begin_layout Enumerate
2184 'magic' does not need to be enlarged: it currently reflects one of 5 values
2185 (used, free, dead, recovery, and unused_recovery).
2186 It is useful for quick sanity checking however, and should not be eliminated.
2189 \begin_layout Enumerate
2190 'tailer' is only used to coalesce free blocks (so a block to the right can
2191 find the header to check if this block is free).
2192 This can be replaced by a single 'free' bit in the header of the following
2193 block (and the tailer only exists in free blocks).
2197 \begin_layout Plain Layout
2198 This technique from Thomas Standish.
2199 Data Structure Techniques.
2200 Addison-Wesley, Reading, Massachusetts, 1980.
2205 The current proposed coalescing algorithm doesn't need this, however.
2208 \begin_layout Standard
2209 This produces a 16 byte used header like this:
2212 \begin_layout LyX-Code
2213 struct tdb_used_record {
2216 \begin_layout LyX-Code
2217 uint32_t magic : 16,
2220 \begin_layout LyX-Code
2224 \begin_layout LyX-Code
2228 \begin_layout LyX-Code
2232 \begin_layout LyX-Code
2233 uint32_t extra_octets;
2236 \begin_layout LyX-Code
2237 uint64_t key_and_data_len;
2240 \begin_layout LyX-Code
2244 \begin_layout Standard
2245 And a free record like this:
2248 \begin_layout LyX-Code
2249 struct tdb_free_record {
2252 \begin_layout LyX-Code
2253 uint32_t free_magic;
2256 \begin_layout LyX-Code
2257 uint64_t total_length;
2258 \change_inserted 0 1283337133
2262 \begin_layout LyX-Code
2264 \change_inserted 0 1283337139
2265 uint64_t prev, next;
2270 \begin_layout LyX-Code
2274 \begin_layout LyX-Code
2278 \begin_layout LyX-Code
2282 \begin_layout Standard
2284 \change_inserted 0 1283337235
2285 We might want to take some bits from the used record's top_hash (and the
2286 free record which has 32 bits of padding to spare anyway) if we use variable
2289 \begin_inset CommandInset ref
2291 reference "freelist-in-zone"
2300 \begin_layout Subsection
2301 Transaction Commit Requires 4 fdatasync
2304 \begin_layout Standard
2305 The current transaction algorithm is:
2308 \begin_layout Enumerate
2309 write_recovery_data();
2312 \begin_layout Enumerate
2316 \begin_layout Enumerate
2317 write_recovery_header();
2320 \begin_layout Enumerate
2324 \begin_layout Enumerate
2325 overwrite_with_new_data();
2328 \begin_layout Enumerate
2332 \begin_layout Enumerate
2333 remove_recovery_header();
2336 \begin_layout Enumerate
2340 \begin_layout Standard
2341 On current ext3, each sync flushes all data to disk, so the next 3 syncs
2342 are relatively expensive.
2343 But this could become a performance bottleneck on other filesystems such
2347 \begin_layout Subsubsection
2351 \begin_layout Standard
2352 Neil Brown points out that this is overzealous, and only one sync is needed:
2355 \begin_layout Enumerate
2356 Bundle the recovery data, a transaction counter and a strong checksum of
2360 \begin_layout Enumerate
2361 Strong checksum that whole bundle.
2364 \begin_layout Enumerate
2365 Store the bundle in the database.
2368 \begin_layout Enumerate
2369 Overwrite the oldest of the two recovery pointers in the header (identified
2370 using the transaction counter) with the offset of this bundle.
2373 \begin_layout Enumerate
2377 \begin_layout Enumerate
2378 Write the new data to the file.
2381 \begin_layout Standard
2382 Checking for recovery means identifying the latest bundle with a valid checksum
2383 and using the new data checksum to ensure that it has been applied.
2384 This is more expensive than the current check, but need only be done at
2386 For running databases, a separate header field can be used to indicate
2387 a transaction in progress; we need only check for recovery if this is set.
2390 \begin_layout Subsection
2391 \begin_inset CommandInset label
2393 name "sub:TDB-Does-Not"
2397 TDB Does Not Have Snapshot Support
2400 \begin_layout Subsubsection
2402 \change_deleted 0 1284423472
2406 \begin_layout Standard
2408 At some point you say
2409 \begin_inset Quotes eld
2413 \begin_inset Quotes erd
2417 \change_inserted 0 1284423891
2419 \change_deleted 0 1284423891
2422 \change_inserted 0 1284423901
2424 \begin_inset CommandInset ref
2426 reference "replay-attribute"
2435 \begin_layout Standard
2436 But as a thought experiment, if we implemented transactions to only overwrite
2437 free entries (this is tricky: there must not be a header in each entry
2438 which indicates whether it is free, but use of presence in metadata elsewhere),
2439 and a pointer to the hash table, we could create an entirely new commit
2440 without destroying existing data.
2441 Then it would be easy to implement snapshots in a similar way.
2444 \begin_layout Standard
2445 This would not allow arbitrary changes to the database, such as tdb_repack
2446 does, and would require more space (since we have to preserve the current
2447 and future entries at once).
2448 If we used hash trees rather than one big hash table, we might only have
2449 to rewrite some sections of the hash, too.
2452 \begin_layout Standard
2453 We could then implement snapshots using a similar method, using multiple
2454 different hash tables/free tables.
2455 \change_inserted 0 1284423495
2459 \begin_layout Subsection
2460 Transactions Cannot Operate in Parallel
2463 \begin_layout Standard
2464 This would be useless for ldb, as it hits the index records with just about
2466 It would add significant complexity in resolving clashes, and cause the
2467 all transaction callers to write their code to loop in the case where the
2468 transactions spuriously failed.
2471 \begin_layout Subsubsection
2475 \begin_layout Standard
2477 \change_inserted 0 1284424201
2479 \begin_inset CommandInset ref
2481 reference "replay-attribute"
2488 We could solve a small part of the problem by providing read-only transactions.
2489 These would allow one write transaction to begin, but it could not commit
2490 until all r/o transactions are done.
2491 This would require a new RO_TRANSACTION_LOCK, which would be upgraded on
2495 \begin_layout Subsection
2496 Default Hash Function Is Suboptimal
2499 \begin_layout Standard
2500 The Knuth-inspired multiplicative hash used by tdb is fairly slow (especially
2501 if we expand it to 64 bits), and works best when the hash bucket size is
2502 a prime number (which also means a slow modulus).
2503 In addition, it is highly predictable which could potentially lead to a
2504 Denial of Service attack in some TDB uses.
2507 \begin_layout Subsubsection
2511 \begin_layout Standard
2512 The Jenkins lookup3 hash
2516 \begin_layout Plain Layout
2517 http://burtleburtle.net/bob/c/lookup3.c
2522 is a fast and superbly-mixing hash.
2523 It's used by the Linux kernel and almost everything else.
2524 This has the particular properties that it takes an initial seed, and produces
2525 two 32 bit hash numbers, which we can combine into a 64-bit hash.
2528 \begin_layout Standard
2529 The seed should be created at tdb-creation time from some random source,
2530 and placed in the header.
2531 This is far from foolproof, but adds a little bit of protection against
2535 \begin_layout Subsection
2536 \begin_inset CommandInset label
2538 name "Reliable-Traversal-Adds"
2542 Reliable Traversal Adds Complexity
2545 \begin_layout Standard
2546 We lock a record during traversal iteration, and try to grab that lock in
2548 If that grab on delete fails, we simply mark it deleted and continue onwards;
2549 traversal checks for this condition and does the delete when it moves off
2553 \begin_layout Standard
2554 If traversal terminates, the dead record may be left indefinitely.
2557 \begin_layout Subsubsection
2561 \begin_layout Standard
2562 Remove reliability guarantees; see
2563 \begin_inset CommandInset ref
2565 reference "traverse-Proposed-Solution"
2572 \begin_layout Subsection
2573 Fcntl Locking Adds Overhead
2576 \begin_layout Standard
2577 Placing a fcntl lock means a system call, as does removing one.
2578 This is actually one reason why transactions can be faster (everything
2579 is locked once at transaction start).
2580 In the uncontended case, this overhead can theoretically be eliminated.
2583 \begin_layout Subsubsection
2587 \begin_layout Standard
2591 \begin_layout Standard
2592 We tried this before with spinlock support, in the early days of TDB, and
2593 it didn't make much difference except in manufactured benchmarks.
2596 \begin_layout Standard
2597 We could use spinlocks (with futex kernel support under Linux), but it means
2598 that we lose automatic cleanup when a process dies with a lock.
2599 There is a method of auto-cleanup under Linux, but it's not supported by
2600 other operating systems.
2601 We could reintroduce a clear-if-first-style lock and sweep for dead futexes
2602 on open, but that wouldn't help the normal case of one concurrent opener
2604 Increasingly elaborate repair schemes could be considered, but they require
2605 an ABI change (everyone must use them) anyway, so there's no need to do
2606 this at the same time as everything else.
2609 \begin_layout Subsection
2610 Some Transactions Don't Require Durability
2613 \begin_layout Standard
2614 Volker points out that gencache uses a CLEAR_IF_FIRST tdb for normal (fast)
2615 usage, and occasionally empties the results into a transactional TDB.
2616 This kind of usage prioritizes performance over durability: as long as
2617 we are consistent, data can be lost.
2620 \begin_layout Standard
2621 This would be more neatly implemented inside tdb: a
2622 \begin_inset Quotes eld
2626 \begin_inset Quotes erd
2629 transaction commit (ie.
2630 syncless) which meant that data may be reverted on a crash.
2633 \begin_layout Subsubsection
2637 \begin_layout Standard
2641 \begin_layout Standard
2642 Unfortunately any transaction scheme which overwrites old data requires
2643 a sync before that overwrite to avoid the possibility of corruption.
2646 \begin_layout Standard
2647 It seems possible to use a scheme similar to that described in
2648 \begin_inset CommandInset ref
2650 reference "sub:TDB-Does-Not"
2654 ,where transactions are committed without overwriting existing data, and
2655 an array of top-level pointers were available in the header.
2656 If the transaction is
2657 \begin_inset Quotes eld
2661 \begin_inset Quotes erd
2664 then we would not need a sync at all: existing processes would pick up
2665 the new hash table and free list and work with that.
2668 \begin_layout Standard
2669 At some later point, a sync would allow recovery of the old data into the
2670 free lists (perhaps when the array of top-level pointers filled).
2671 On crash, tdb_open() would examine the array of top levels, and apply the
2672 transactions until it encountered an invalid checksum.
2673 \change_inserted 0 1284423555
2677 \begin_layout Subsection
2679 \change_inserted 0 1284423617
2680 Tracing Is Fragile, Replay Is External
2683 \begin_layout Standard
2685 \change_inserted 0 1284423719
2686 The current TDB has compile-time-enabled tracing code, but it often breaks
2687 as it is not enabled by default.
2688 In a similar way, the ctdb code has an external wrapper which does replay
2689 tracing so it can coordinate cluster-wide transactions.
2692 \begin_layout Subsubsection
2694 \change_inserted 0 1284423864
2696 \begin_inset CommandInset label
2698 name "replay-attribute"
2705 \begin_layout Standard
2707 \change_inserted 0 1284423850
2708 Tridge points out that an attribute can be later added to tdb_open (see
2710 \begin_inset CommandInset ref
2712 reference "attributes"
2716 ) to provide replay/trace hooks, which could become the basis for this and
2717 future parallel transactions and snapshot support.