9 date 2010.09.02.02.29.05; author rusty; state Exp;
14 date 2010.09.01.10.58.12; author rusty; state Exp;
19 date 2010.08.02.00.21.43; author rusty; state Exp;
24 date 2010.08.02.00.21.16; author rusty; state Exp;
29 date 2010.05.10.13.09.11; author rusty; state Exp;
34 date 2010.05.10.11.58.37; author rusty; state Exp;
39 date 2010.05.10.05.35.13; author rusty; state Exp;
44 date 2010.05.04.02.29.16; author rusty; state Exp;
56 @Remove bogus footnote
59 @#LyX 1.6.5 created this file. For more info see http://www.lyx.org/
64 \use_default_options true
69 \font_typewriter default
70 \font_default_family default
77 \paperfontsize default
85 \paperorientation portrait
88 \paragraph_separation indent
90 \quotes_language english
93 \paperpagestyle default
94 \tracking_changes true
96 \author "Rusty Russell,,,"
103 TDB2: A Redesigning The Trivial DataBase
107 Rusty Russell, IBM Corporation
112 \change_deleted 0 1283307542
114 \change_inserted 0 1283307544
120 \begin_layout Abstract
121 The Trivial DataBase on-disk format is 32 bits; with usage cases heading
122 towards the 4G limit, that must change.
123 This required breakage provides an opportunity to revisit TDB's other design
124 decisions and reassess them.
127 \begin_layout Section
131 \begin_layout Standard
132 The Trivial DataBase was originally written by Andrew Tridgell as a simple
133 key/data pair storage system with the same API as dbm, but allowing multiple
134 readers and writers while being small enough (< 1000 lines of C) to include
136 The simple design created in 1999 has proven surprisingly robust and performant
137 , used in Samba versions 3 and 4 as well as numerous other projects.
138 Its useful life was greatly increased by the (backwards-compatible!) addition
139 of transaction support in 2005.
142 \begin_layout Standard
143 The wider variety and greater demands of TDB-using code has lead to some
144 organic growth of the API, as well as some compromises on the implementation.
145 None of these, by themselves, are seen as show-stoppers, but the cumulative
146 effect is to a loss of elegance over the initial, simple TDB implementation.
147 Here is a table of the approximate number of lines of implementation code
148 and number of API functions at the end of each year:
151 \begin_layout Standard
153 <lyxtabular version="3" rows="12" columns="3">
155 <column alignment="center" valignment="top" width="0">
156 <column alignment="center" valignment="top" width="0">
157 <column alignment="center" valignment="top" width="0">
159 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
162 \begin_layout Plain Layout
168 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
171 \begin_layout Plain Layout
177 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
180 \begin_layout Plain Layout
181 Lines of C Code Implementation
188 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
191 \begin_layout Plain Layout
197 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
200 \begin_layout Plain Layout
206 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
209 \begin_layout Plain Layout
217 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
220 \begin_layout Plain Layout
226 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
229 \begin_layout Plain Layout
235 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
238 \begin_layout Plain Layout
246 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
249 \begin_layout Plain Layout
255 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
258 \begin_layout Plain Layout
264 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
267 \begin_layout Plain Layout
275 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
278 \begin_layout Plain Layout
284 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
287 \begin_layout Plain Layout
293 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
296 \begin_layout Plain Layout
304 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
307 \begin_layout Plain Layout
313 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
316 \begin_layout Plain Layout
322 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
325 \begin_layout Plain Layout
333 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
336 \begin_layout Plain Layout
342 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
345 \begin_layout Plain Layout
351 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
354 \begin_layout Plain Layout
362 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
365 \begin_layout Plain Layout
371 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
374 \begin_layout Plain Layout
380 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
383 \begin_layout Plain Layout
391 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
394 \begin_layout Plain Layout
400 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
403 \begin_layout Plain Layout
409 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
412 \begin_layout Plain Layout
420 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
423 \begin_layout Plain Layout
429 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
432 \begin_layout Plain Layout
438 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
441 \begin_layout Plain Layout
449 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
452 \begin_layout Plain Layout
458 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
461 \begin_layout Plain Layout
467 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
470 \begin_layout Plain Layout
478 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
481 \begin_layout Plain Layout
487 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
490 \begin_layout Plain Layout
496 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
499 \begin_layout Plain Layout
513 \begin_layout Standard
514 This review is an attempt to catalog and address all the known issues with
515 TDB and create solutions which address the problems without significantly
516 increasing complexity; all involved are far too aware of the dangers of
517 second system syndrome in rewriting a successful project like this.
520 \begin_layout Section
524 \begin_layout Subsection
525 tdb_open_ex Is Not Expandable
528 \begin_layout Standard
529 The tdb_open() call was expanded to tdb_open_ex(), which added an optional
530 hashing function and an optional logging function argument.
531 Additional arguments to open would require the introduction of a tdb_open_ex2
535 \begin_layout Subsubsection
539 \begin_layout Standard
540 tdb_open() will take a linked-list of attributes:
543 \begin_layout LyX-Code
547 \begin_layout LyX-Code
548 TDB_ATTRIBUTE_LOG = 0,
551 \begin_layout LyX-Code
552 TDB_ATTRIBUTE_HASH = 1
555 \begin_layout LyX-Code
559 \begin_layout LyX-Code
560 struct tdb_attribute_base {
563 \begin_layout LyX-Code
564 enum tdb_attribute attr;
567 \begin_layout LyX-Code
568 union tdb_attribute *next;
571 \begin_layout LyX-Code
575 \begin_layout LyX-Code
576 struct tdb_attribute_log {
579 \begin_layout LyX-Code
580 struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG */
583 \begin_layout LyX-Code
587 \begin_layout LyX-Code
591 \begin_layout LyX-Code
595 \begin_layout LyX-Code
596 struct tdb_attribute_hash {
599 \begin_layout LyX-Code
600 struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH */
603 \begin_layout LyX-Code
604 tdb_hash_func hash_fn;
607 \begin_layout LyX-Code
611 \begin_layout LyX-Code
615 \begin_layout LyX-Code
616 union tdb_attribute {
619 \begin_layout LyX-Code
620 struct tdb_attribute_base base;
623 \begin_layout LyX-Code
624 struct tdb_attribute_log log;
627 \begin_layout LyX-Code
628 struct tdb_attribute_hash hash;
631 \begin_layout LyX-Code
635 \begin_layout Standard
636 This allows future attributes to be added, even if this expands the size
640 \begin_layout Subsection
641 tdb_traverse Makes Impossible Guarantees
644 \begin_layout Standard
645 tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, and it
646 was thought that it was important to guarantee that all records which exist
647 at the start and end of the traversal would be included, and no record
648 would be included twice.
651 \begin_layout Standard
652 This adds complexity (see
653 \begin_inset CommandInset ref
655 reference "Reliable-Traversal-Adds"
659 ) and does not work anyway for records which are altered (in particular,
660 those which are expanded may be effectively deleted and re-added behind
664 \begin_layout Subsubsection
665 \begin_inset CommandInset label
667 name "traverse-Proposed-Solution"
674 \begin_layout Standard
675 Abandon the guarantee.
676 You will see every record if no changes occur during your traversal, otherwise
677 you will see some subset.
678 You can prevent changes by using a transaction or the locking API.
681 \begin_layout Subsection
682 Nesting of Transactions Is Fraught
685 \begin_layout Standard
686 TDB has alternated between allowing nested transactions and not allowing
688 Various paths in the Samba codebase assume that transactions will nest,
689 and in a sense they can: the operation is only committed to disk when the
690 outer transaction is committed.
691 There are two problems, however:
694 \begin_layout Enumerate
695 Canceling the inner transaction will cause the outer transaction commit
696 to fail, and will not undo any operations since the inner transaction began.
697 This problem is soluble with some additional internal code.
700 \begin_layout Enumerate
701 An inner transaction commit can be cancelled by the outer transaction.
702 This is desirable in the way which Samba's database initialization code
703 uses transactions, but could be a surprise to any users expecting a successful
704 transaction commit to expose changes to others.
707 \begin_layout Standard
708 The current solution is to specify the behavior at tdb_open(), with the
709 default currently that nested transactions are allowed.
710 This flag can also be changed at runtime.
713 \begin_layout Subsubsection
717 \begin_layout Standard
718 Given the usage patterns, it seems that the
719 \begin_inset Quotes eld
723 \begin_inset Quotes erd
726 behavior of disallowing nested transactions should become the default.
727 Additionally, it seems the outer transaction is the only code which knows
728 whether inner transactions should be allowed, so a flag to indicate this
729 could be added to tdb_transaction_start.
730 However, this behavior can be simulated with a wrapper which uses tdb_add_flags
731 () and tdb_remove_flags(), so the API should not be expanded for this relatively
735 \begin_layout Subsection
736 Incorrect Hash Function is Not Detected
739 \begin_layout Standard
740 tdb_open_ex() allows the calling code to specify a different hash function
741 to use, but does not check that all other processes accessing this tdb
742 are using the same hash function.
743 The result is that records are missing from tdb_fetch().
746 \begin_layout Subsubsection
750 \begin_layout Standard
751 The header should contain an example hash result (eg.
752 the hash of 0xdeadbeef), and tdb_open_ex() should check that the given
753 hash function produces the same answer, or fail the tdb_open call.
756 \begin_layout Subsection
757 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
760 \begin_layout Standard
761 In response to scalability issues with the free list (
762 \begin_inset CommandInset ref
764 reference "TDB-Freelist-Is"
768 ) two API workarounds have been incorporated in TDB: tdb_set_max_dead()
769 and the TDB_VOLATILE flag to tdb_open.
770 The latter actually calls the former with an argument of
771 \begin_inset Quotes eld
775 \begin_inset Quotes erd
781 \begin_layout Standard
782 This code allows deleted records to accumulate without putting them in the
784 On delete we iterate through each chain and free them in a batch if there
785 are more than max_dead entries.
786 These are never otherwise recycled except as a side-effect of a tdb_repack.
789 \begin_layout Subsubsection
793 \begin_layout Standard
794 With the scalability problems of the freelist solved, this API can be removed.
795 The TDB_VOLATILE flag may still be useful as a hint that store and delete
796 of records will be at least as common as fetch in order to allow some internal
797 tuning, but initially will become a no-op.
800 \begin_layout Subsection
801 \begin_inset CommandInset label
803 name "TDB-Files-Cannot"
807 TDB Files Cannot Be Opened Multiple Times In The Same Process
810 \begin_layout Standard
811 No process can open the same TDB twice; we check and disallow it.
812 This is an unfortunate side-effect of fcntl locks, which operate on a per-file
813 rather than per-file-descriptor basis, and do not nest.
814 Thus, closing any file descriptor on a file clears all the locks obtained
815 by this process, even if they were placed using a different file descriptor!
818 \begin_layout Standard
819 Note that even if this were solved, deadlock could occur if operations were
820 nested: this is a more manageable programming error in most cases.
823 \begin_layout Subsubsection
827 \begin_layout Standard
828 We could lobby POSIX to fix the perverse rules, or at least lobby Linux
829 to violate them so that the most common implementation does not have this
831 This would be a generally good idea for other fcntl lock users.
834 \begin_layout Standard
835 Samba uses a wrapper which hands out the same tdb_context to multiple callers
836 if this happens, and does simple reference counting.
837 We should do this inside the tdb library, which already emulates lock nesting
838 internally; it would need to recognize when deadlock occurs within a single
840 This would create a new failure mode for tdb operations (while we currently
841 handle locking failures, they are impossible in normal use and a process
842 encountering them can do little but give up).
845 \begin_layout Standard
846 I do not see benefit in an additional tdb_open flag to indicate whether
847 re-opening is allowed, as though there may be some benefit to adding a
848 call to detect when a tdb_context is shared, to allow other to create such
852 \begin_layout Subsection
853 TDB API Is Not POSIX Thread-safe
856 \begin_layout Standard
857 The TDB API uses an error code which can be queried after an operation to
858 determine what went wrong.
859 This programming model does not work with threads, unless specific additional
860 guarantees are given by the implementation.
861 In addition, even otherwise-independent threads cannot open the same TDB
863 \begin_inset CommandInset ref
865 reference "TDB-Files-Cannot"
872 \begin_layout Subsubsection
876 \begin_layout Standard
877 Reachitecting the API to include a tdb_errcode pointer would be a great
878 deal of churn; we are better to guarantee that the tdb_errcode is per-thread
879 so the current programming model can be maintained.
882 \begin_layout Standard
883 This requires dynamic per-thread allocations, which is awkward with POSIX
884 threads (pthread_key_create space is limited and we cannot simply allocate
885 a key for every TDB).
888 \begin_layout Standard
889 Internal locking is required to make sure that fcntl locks do not overlap
890 between threads, and also that the global list of tdbs is maintained.
893 \begin_layout Standard
894 The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe
895 version of the library, and otherwise no overhead will exist.
898 \begin_layout Subsection
899 *_nonblock Functions And *_mark Functions Expose Implementation
902 \begin_layout Standard
907 \begin_layout Plain Layout
908 Clustered TDB, see http://ctdb.samba.org
913 wishes to operate on TDB in a non-blocking manner.
914 This is currently done as follows:
917 \begin_layout Enumerate
918 Call the _nonblock variant of an API function (eg.
919 tdb_lockall_nonblock).
923 \begin_layout Enumerate
924 Fork a child process, and wait for it to call the normal variant (eg.
928 \begin_layout Enumerate
929 If the child succeeds, call the _mark variant to indicate we already have
934 \begin_layout Enumerate
935 Upon completion, tell the child to release the locks (eg.
939 \begin_layout Enumerate
940 Indicate to tdb that it should consider the locks removed (eg.
944 \begin_layout Standard
945 There are several issues with this approach.
946 Firstly, adding two new variants of each function clutters the API for
947 an obscure use, and so not all functions have three variants.
948 Secondly, it assumes that all paths of the functions ask for the same locks,
949 otherwise the parent process will have to get a lock which the child doesn't
950 have under some circumstances.
951 I don't believe this is currently the case, but it constrains the implementatio
956 \begin_layout Subsubsection
957 \begin_inset CommandInset label
959 name "Proposed-Solution-locking-hook"
966 \begin_layout Standard
967 Implement a hook for locking methods, so that the caller can control the
968 calls to create and remove fcntl locks.
969 In this scenario, ctdbd would operate as follows:
972 \begin_layout Enumerate
973 Call the normal API function, eg tdb_lockall().
976 \begin_layout Enumerate
977 When the lock callback comes in, check if the child has the lock.
978 Initially, this is always false.
980 Otherwise, try to obtain it in non-blocking mode.
981 If that fails, return EWOULDBLOCK.
984 \begin_layout Enumerate
985 Release locks in the unlock callback as normal.
988 \begin_layout Enumerate
989 If tdb_lockall() fails, see if we recorded a lock failure; if so, call the
990 child to repeat the operation.
993 \begin_layout Enumerate
994 The child records what locks it obtains, and returns that information to
998 \begin_layout Enumerate
999 When the child has succeeded, goto 1.
1002 \begin_layout Standard
1003 This is flexible enough to handle any potential locking scenario, even when
1004 lock requirements change.
1005 It can be optimized so that the parent does not release locks, just tells
1006 the child which locks it doesn't need to obtain.
1009 \begin_layout Standard
1010 It also keeps the complexity out of the API, and in ctdbd where it is needed.
1013 \begin_layout Subsection
1014 tdb_chainlock Functions Expose Implementation
1017 \begin_layout Standard
1018 tdb_chainlock locks some number of records, including the record indicated
1020 This gave atomicity guarantees; no-one can start a transaction, alter,
1021 read or delete that key while the lock is held.
1024 \begin_layout Standard
1025 It also makes the same guarantee for any other key in the chain, which is
1026 an internal implementation detail and potentially a cause for deadlock.
1029 \begin_layout Subsubsection
1033 \begin_layout Standard
1035 It would be nice to have an explicit single entry lock which effected no
1037 Unfortunately, this won't work for an entry which doesn't exist.
1038 Thus while chainlock may be implemented more efficiently for the existing
1039 case, it will still have overlap issues with the non-existing case.
1040 So it is best to keep the current (lack of) guarantee about which records
1041 will be effected to avoid constraining our implementation.
1044 \begin_layout Subsection
1045 Signal Handling is Not Race-Free
1048 \begin_layout Standard
1049 The tdb_setalarm_sigptr() call allows the caller's signal handler to indicate
1050 that the tdb locking code should return with a failure, rather than trying
1051 again when a signal is received (and errno == EAGAIN).
1052 This is usually used to implement timeouts.
1055 \begin_layout Standard
1056 Unfortunately, this does not work in the case where the signal is received
1057 before the tdb code enters the fcntl() call to place the lock: the code
1058 will sleep within the fcntl() code, unaware that the signal wants it to
1060 In the case of long timeouts, this does not happen in practice.
1063 \begin_layout Subsubsection
1067 \begin_layout Standard
1068 The locking hooks proposed in
1069 \begin_inset CommandInset ref
1071 reference "Proposed-Solution-locking-hook"
1075 would allow the user to decide on whether to fail the lock acquisition
1077 This allows the caller to choose their own compromise: they could narrow
1078 the race by checking immediately before the fcntl call.
1082 \begin_layout Plain Layout
1083 It may be possible to make this race-free in some implementations by having
1084 the signal handler alter the struct flock to make it invalid.
1085 This will cause the fcntl() lock call to fail with EINVAL if the signal
1086 occurs before the kernel is entered, otherwise EAGAIN.
1094 \begin_layout Subsection
1095 The API Uses Gratuitous Typedefs, Capitals
1098 \begin_layout Standard
1099 typedefs are useful for providing source compatibility when types can differ
1100 across implementations, or arguably in the case of function pointer definitions
1101 which are hard for humans to parse.
1102 Otherwise it is simply obfuscation and pollutes the namespace.
1105 \begin_layout Standard
1106 Capitalization is usually reserved for compile-time constants and macros.
1109 \begin_layout Description
1110 TDB_CONTEXT There is no reason to use this over 'struct tdb_context'; the
1111 definition isn't visible to the API user anyway.
1114 \begin_layout Description
1115 TDB_DATA There is no reason to use this over struct TDB_DATA; the struct
1116 needs to be understood by the API user.
1119 \begin_layout Description
1121 \begin_inset space ~
1124 TDB_DATA This would normally be called 'struct tdb_data'.
1127 \begin_layout Description
1129 \begin_inset space ~
1132 TDB_ERROR Similarly, this would normally be enum tdb_error.
1135 \begin_layout Subsubsection
1139 \begin_layout Standard
1141 Introducing lower case variants would please pedants like myself, but if
1142 it were done the existing ones should be kept.
1143 There is little point forcing a purely cosmetic change upon tdb users.
1146 \begin_layout Subsection
1147 \begin_inset CommandInset label
1149 name "tdb_log_func-Doesnt-Take"
1153 tdb_log_func Doesn't Take The Private Pointer
1156 \begin_layout Standard
1157 For API compatibility reasons, the logging function needs to call tdb_get_loggin
1158 g_private() to retrieve the pointer registered by the tdb_open_ex for logging.
1161 \begin_layout Subsubsection
1165 \begin_layout Standard
1166 It should simply take an extra argument, since we are prepared to break
1170 \begin_layout Subsection
1171 Various Callback Functions Are Not Typesafe
1174 \begin_layout Standard
1175 The callback functions in tdb_set_logging_function (after
1176 \begin_inset CommandInset ref
1178 reference "tdb_log_func-Doesnt-Take"
1182 is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read and tdb_check
1183 all take void * and must internally convert it to the argument type they
1187 \begin_layout Standard
1188 If this type changes, the compiler will not produce warnings on the callers,
1189 since it only sees void *.
1192 \begin_layout Subsubsection
1196 \begin_layout Standard
1197 With careful use of macros, we can create callback functions which give
1198 a warning when used on gcc and the types of the callback and its private
1200 Unsupported compilers will not give a warning, which is no worse than now.
1201 In addition, the callbacks become clearer, as they need not use void *
1202 for their parameter.
1205 \begin_layout Standard
1206 See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html
1209 \begin_layout Subsection
1210 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic
1213 \begin_layout Standard
1214 The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB file should
1215 be cleared if the caller discovers it is the only process with the TDB
1217 However, if any caller does not specify TDB_CLEAR_IF_FIRST it will not
1218 be detected, so will have the TDB erased underneath them (usually resulting
1222 \begin_layout Standard
1223 There is a similar issue on fork(); if the parent exits (or otherwise closes
1224 the tdb) before the child calls tdb_reopen_all() to establish the lock
1225 used to indicate the TDB is opened by someone, a TDB_CLEAR_IF_FIRST opener
1226 at that moment will believe it alone has opened the TDB and will erase
1230 \begin_layout Subsubsection
1234 \begin_layout Standard
1235 Remove TDB_CLEAR_IF_FIRST.
1236 Other workarounds are possible, but see
1237 \begin_inset CommandInset ref
1239 reference "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1246 \begin_layout Section
1247 Performance And Scalability Issues
1250 \begin_layout Subsection
1251 \begin_inset CommandInset label
1253 name "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1257 TDB_CLEAR_IF_FIRST Imposes Performance Penalty
1260 \begin_layout Standard
1261 When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is placed at offset
1264 While these locks never conflict in normal tdb usage, they do add substantial
1265 overhead for most fcntl lock implementations when the kernel scans to detect
1266 if a lock conflict exists.
1267 This is often a single linked list, making the time to acquire and release
1268 a fcntl lock O(N) where N is the number of processes with the TDB open,
1269 not the number actually doing work.
1272 \begin_layout Standard
1273 In a Samba server it is common to have huge numbers of clients sitting idle,
1274 and thus they have weaned themselves off the TDB_CLEAR_IF_FIRST flag.
1278 \begin_layout Plain Layout
1279 There is a flag to tdb_reopen_all() which is used for this optimization:
1280 if the parent process will outlive the child, the child does not need the
1282 This is a workaround for this very performance issue.
1290 \begin_layout Subsubsection
1294 \begin_layout Standard
1296 It was a neat idea, but even trivial servers tend to know when they are
1297 initializing for the first time and can simply unlink the old tdb at that
1301 \begin_layout Subsection
1302 TDB Files Have a 4G Limit
1305 \begin_layout Standard
1306 This seems to be becoming an issue (so much for
1307 \begin_inset Quotes eld
1311 \begin_inset Quotes erd
1314 !), particularly for ldb.
1317 \begin_layout Subsubsection
1321 \begin_layout Standard
1322 A new, incompatible TDB format which uses 64 bit offsets internally rather
1324 For simplicity of endian conversion (which TDB does on the fly if required),
1325 all values will be 64 bit on disk.
1326 In practice, some upper bits may be used for other purposes, but at least
1327 56 bits will be available for file offsets.
1330 \begin_layout Standard
1331 tdb_open() will automatically detect the old version, and even create them
1332 if TDB_VERSION6 is specified to tdb_open.
1335 \begin_layout Standard
1336 32 bit processes will still be able to access TDBs larger than 4G (assuming
1337 that their off_t allows them to seek to 64 bits), they will gracefully
1338 fall back as they fail to mmap.
1339 This can happen already with large TDBs.
1342 \begin_layout Standard
1343 Old versions of tdb will fail to open the new TDB files (since 28 August
1344 2009, commit 398d0c29290: prior to that any unrecognized file format would
1345 be erased and initialized as a fresh tdb!)
1348 \begin_layout Subsection
1349 TDB Records Have a 4G Limit
1352 \begin_layout Standard
1353 This has not been a reported problem, and the API uses size_t which can
1354 be 64 bit on 64 bit platforms.
1355 However, other limits may have made such an issue moot.
1358 \begin_layout Subsubsection
1362 \begin_layout Standard
1363 Record sizes will be 64 bit, with an error returned on 32 bit platforms
1364 which try to access such records (the current implementation would return
1365 TDB_ERR_OOM in a similar case).
1366 It seems unlikely that 32 bit keys will be a limitation, so the implementation
1367 may not support this (see
1368 \begin_inset CommandInset ref
1370 reference "sub:Records-Incur-A"
1377 \begin_layout Subsection
1378 Hash Size Is Determined At TDB Creation Time
1381 \begin_layout Standard
1382 TDB contains a number of hash chains in the header; the number is specified
1383 at creation time, and defaults to 131.
1384 This is such a bottleneck on large databases (as each hash chain gets quite
1385 long), that LDB uses 10,000 for this hash.
1386 In general it is impossible to know what the 'right' answer is at database
1390 \begin_layout Subsubsection
1392 \change_inserted 0 1283336713
1393 \begin_inset CommandInset label
1395 name "sub:Hash-Size-Solution"
1404 \begin_layout Standard
1405 After comprehensive performance testing on various scalable hash variants
1409 \begin_layout Plain Layout
1410 http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying
1411 because I was previously convinced that an expanding tree of hashes would
1412 be very close to optimal.
1417 , it became clear that it is hard to beat a straight linear hash table which
1418 doubles in size when it reaches saturation.
1420 \change_deleted 0 1283307675
1421 There are three details which become important:
1424 \begin_layout Enumerate
1426 \change_deleted 0 1283307675
1427 On encountering a full bucket, we use the next bucket.
1430 \begin_layout Enumerate
1432 \change_deleted 0 1283307675
1433 Extra hash bits are stored with the offset, to reduce comparisons.
1436 \begin_layout Enumerate
1438 \change_deleted 0 1283307675
1439 A marker entry is used on deleting an entry.
1442 \begin_layout Standard
1444 \change_deleted 0 1283307675
1445 The doubling of the table must be done under a transaction; we will not
1446 reduce it on deletion, so it will be an unusual case.
1447 It will either be placed at the head (other entries will be moved out the
1448 way so we can expand).
1449 We could have a pointer in the header to the current hashtable location,
1450 but that pointer would have to be read frequently to check for hashtable
1454 \begin_layout Standard
1456 \change_deleted 0 1283307675
1457 The locking for this is slightly more complex than the chained case; we
1458 currently have one lock per bucket, and that means we would need to expand
1459 the lock if we overflow to the next bucket.
1460 The frequency of such collisions will effect our locking heuristics: we
1461 can always lock more buckets than we need.
1464 \begin_layout Standard
1466 \change_deleted 0 1283307675
1467 One possible optimization is to only re-check the hash size on an insert
1470 \change_inserted 0 1283307770
1471 Unfortunately, altering the hash table introduces serious locking complications
1472 : the entire hash table needs to be locked to enlarge the hash table, and
1473 others might be holding locks.
1474 Particularly insidious are insertions done under tdb_chainlock.
1477 \begin_layout Standard
1479 \change_inserted 0 1283336187
1480 Thus an expanding layered hash will be used: an array of hash groups, with
1481 each hash group exploding into pointers to lower hash groups once it fills,
1482 turning into a hash tree.
1483 This has implications for locking: we must lock the entire group in case
1484 we need to expand it, yet we don't know how deep the tree is at that point.
1487 \begin_layout Standard
1489 \change_inserted 0 1283336586
1490 Note that bits from the hash table entries should be stolen to hold more
1491 hash bits to reduce the penalty of collisions.
1492 We can use the otherwise-unused lower 3 bits.
1493 If we limit the size of the database to 64 exabytes, we can use the top
1494 8 bits of the hash entry as well.
1495 These 11 bits would reduce false positives down to 1 in 2000 which is more
1496 than we need: we can use one of the bits to indicate that the extra hash
1498 This means we can choose not to re-hash all entries when we expand a hash
1499 group; simply use the next bits we need and mark them invalid.
1504 \begin_layout Subsection
1505 \begin_inset CommandInset label
1507 name "TDB-Freelist-Is"
1511 TDB Freelist Is Highly Contended
1514 \begin_layout Standard
1515 TDB uses a single linked list for the free list.
1516 Allocation occurs as follows, using heuristics which have evolved over
1520 \begin_layout Enumerate
1521 Get the free list lock for this whole operation.
1524 \begin_layout Enumerate
1525 Multiply length by 1.25, so we always over-allocate by 25%.
1528 \begin_layout Enumerate
1529 Set the slack multiplier to 1.
1532 \begin_layout Enumerate
1533 Examine the current freelist entry: if it is > length but < the current
1534 best case, remember it as the best case.
1537 \begin_layout Enumerate
1538 Multiply the slack multiplier by 1.05.
1541 \begin_layout Enumerate
1542 If our best fit so far is less than length * slack multiplier, return it.
1543 The slack will be turned into a new free record if it's large enough.
1546 \begin_layout Enumerate
1547 Otherwise, go onto the next freelist entry.
1550 \begin_layout Standard
1551 Deleting a record occurs as follows:
1554 \begin_layout Enumerate
1555 Lock the hash chain for this whole operation.
1558 \begin_layout Enumerate
1559 Walk the chain to find the record, keeping the prev pointer offset.
1562 \begin_layout Enumerate
1563 If max_dead is non-zero:
1567 \begin_layout Enumerate
1568 Walk the hash chain again and count the dead records.
1571 \begin_layout Enumerate
1572 If it's more than max_dead, bulk free all the dead ones (similar to steps
1573 4 and below, but the lock is only obtained once).
1576 \begin_layout Enumerate
1577 Simply mark this record as dead and return.
1582 \begin_layout Enumerate
1583 Get the free list lock for the remainder of this operation.
1586 \begin_layout Enumerate
1587 \begin_inset CommandInset label
1589 name "right-merging"
1593 Examine the following block to see if it is free; if so, enlarge the current
1594 block and remove that block from the free list.
1595 This was disabled, as removal from the free list was O(entries-in-free-list).
1598 \begin_layout Enumerate
1599 Examine the preceeding block to see if it is free: for this reason, each
1600 block has a 32-bit tailer which indicates its length.
1601 If it is free, expand it to cover our new block and return.
1604 \begin_layout Enumerate
1605 Otherwise, prepend ourselves to the free list.
1608 \begin_layout Standard
1609 Disabling right-merging (step
1610 \begin_inset CommandInset ref
1612 reference "right-merging"
1616 ) causes fragmentation; the other heuristics proved insufficient to address
1617 this, so the final answer to this was that when we expand the TDB file
1618 inside a transaction commit, we repack the entire tdb.
1621 \begin_layout Standard
1622 The single list lock limits our allocation rate; due to the other issues
1623 this is not currently seen as a bottleneck.
1626 \begin_layout Subsubsection
1628 \change_deleted 0 1283336858
1632 \begin_layout Standard
1633 The first step is to remove all the current heuristics, as they obviously
1634 interact, then examine them once the lock contention is addressed.
1637 \begin_layout Standard
1638 The free list must be split to reduce contention.
1639 Assuming perfect free merging, we can at most have 1 free list entry for
1641 This implies that the number of free lists is related to the size of the
1642 hash table, but as it is rare to walk a large number of free list entries
1643 we can use far fewer, say 1/32 of the number of hash buckets.
1644 \change_inserted 0 1283336910
1648 \begin_layout Standard
1650 \change_inserted 0 1283337052
1651 It seems tempting to try to reuse the hash implementation which we use for
1652 records here, but we have two ways of searching for free entries: for allocatio
1653 n we search by size (and possibly zone) which produces too many clashes
1654 for our hash table to handle well, and for coalescing we search by address.
1655 Thus an array of doubly-linked free lists seems preferable.
1660 \begin_layout Standard
1661 There are various benefits in using per-size free lists (see
1662 \begin_inset CommandInset ref
1664 reference "sub:TDB-Becomes-Fragmented"
1668 ) but it's not clear this would reduce contention in the common case where
1669 all processes are allocating/freeing the same size.
1670 Thus we almost certainly need to divide in other ways: the most obvious
1671 is to divide the file into zones, and using a free list (or set of free
1673 This approximates address ordering.
1676 \begin_layout Standard
1677 Note that this means we need to split the free lists when we expand the
1678 file; this is probably acceptable when we double the hash table size, since
1679 that is such an expensive operation already.
1680 In the case of increasing the file size, there is an optimization we can
1681 use: if we use M in the formula above as the file size rounded up to the
1682 next power of 2, we only need reshuffle free lists when the file size crosses
1683 a power of 2 boundary,
1687 reshuffling the free lists is trivial: we simply merge every consecutive
1691 \begin_layout Standard
1692 The basic algorithm is as follows.
1696 \begin_layout Enumerate
1697 Identify the correct zone.
1700 \begin_layout Enumerate
1701 Lock the corresponding list.
1704 \begin_layout Enumerate
1705 Re-check the zone (we didn't have a lock, sizes could have changed): relock
1709 \begin_layout Enumerate
1710 Place the freed entry in the list for that zone.
1713 \begin_layout Standard
1714 Allocation is a little more complicated, as we perform delayed coalescing
1718 \begin_layout Enumerate
1719 Pick a zone either the zone we last freed into, or based on a
1720 \begin_inset Quotes eld
1724 \begin_inset Quotes erd
1730 \begin_layout Enumerate
1731 Lock the corresponding list.
1734 \begin_layout Enumerate
1735 Re-check the zone: relock if necessary.
1738 \begin_layout Enumerate
1739 If the top entry is -large enough, remove it from the list and return it.
1742 \begin_layout Enumerate
1743 Otherwise, coalesce entries in the list.If there was no entry large enough,
1744 unlock the list and try the next zone.
1747 \begin_layout Enumerate
1748 If no zone satisfies, expand the file.
1751 \begin_layout Standard
1752 This optimizes rapid insert/delete of free list entries by not coalescing
1754 First-fit address ordering ordering seems to be fairly good for keeping
1755 fragmentation low (see
1756 \begin_inset CommandInset ref
1758 reference "sub:TDB-Becomes-Fragmented"
1763 Note that address ordering does not need a tailer to coalesce, though if
1764 we needed one we could have one cheaply: see
1765 \begin_inset CommandInset ref
1767 reference "sub:Records-Incur-A"
1775 \begin_layout Standard
1776 I anticipate that the number of entries in each free zone would be small,
1777 but it might be worth using one free entry to hold pointers to the others
1778 for cache efficiency.
1779 \change_inserted 0 1283309850
1783 \begin_layout Standard
1785 \change_inserted 0 1283337216
1786 \begin_inset CommandInset label
1788 name "freelist-in-zone"
1792 If we want to avoid locking complexity (enlarging the free lists when we
1793 enlarge the file) we could place the array of free lists at the beginning
1795 This means existing array lists never move, but means that a record cannot
1796 be larger than a zone.
1797 That in turn implies that zones should be variable sized (say, power of
1798 2), which makes the question
1799 \begin_inset Quotes eld
1802 what zone is this record in?
1803 \begin_inset Quotes erd
1807 \begin_inset Quotes eld
1811 \begin_inset Quotes erd
1814 , but that's less common).
1815 It could be done with as few as 4 bits from the record header.
1819 \begin_layout Plain Layout
1821 \change_inserted 0 1283310945
1823 \begin_inset Formula $2^{16+N*3}$
1826 means 0 gives a minimal 65536-byte zone, 15 gives the maximal
1827 \begin_inset Formula $2^{61}$
1831 Zones range in factor of 8 steps.
1843 \begin_layout Subsection
1844 \begin_inset CommandInset label
1846 name "sub:TDB-Becomes-Fragmented"
1850 TDB Becomes Fragmented
1853 \begin_layout Standard
1854 Much of this is a result of allocation strategy
1858 \begin_layout Plain Layout
1859 The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 ftp://ftp.cs.ute
1860 xas.edu/pub/garbage/malloc/ismm98.ps
1865 and deliberate hobbling of coalescing; internal fragmentation (aka overallocati
1866 on) is deliberately set at 25%, and external fragmentation is only cured
1867 by the decision to repack the entire db when a transaction commit needs
1868 to enlarge the file.
1871 \begin_layout Subsubsection
1875 \begin_layout Standard
1876 The 25% overhead on allocation works in practice for ldb because indexes
1877 tend to expand by one record at a time.
1878 This internal fragmentation can be resolved by having an
1879 \begin_inset Quotes eld
1883 \begin_inset Quotes erd
1886 bit in the header to note entries that have previously expanded, and allocating
1887 more space for them.
1890 \begin_layout Standard
1891 There are is a spectrum of possible solutions for external fragmentation:
1892 one is to use a fragmentation-avoiding allocation strategy such as best-fit
1893 address-order allocator.
1894 The other end of the spectrum would be to use a bump allocator (very fast
1895 and simple) and simply repack the file when we reach the end.
1898 \begin_layout Standard
1899 There are three problems with efficient fragmentation-avoiding allocators:
1900 they are non-trivial, they tend to use a single free list for each size,
1901 and there's no evidence that tdb allocation patterns will match those recorded
1902 for general allocators (though it seems likely).
1905 \begin_layout Standard
1906 Thus we don't spend too much effort on external fragmentation; we will be
1907 no worse than the current code if we need to repack on occasion.
1908 More effort is spent on reducing freelist contention, and reducing overhead.
1911 \begin_layout Subsection
1912 \begin_inset CommandInset label
1914 name "sub:Records-Incur-A"
1918 Records Incur A 28-Byte Overhead
1921 \begin_layout Standard
1922 Each TDB record has a header as follows:
1925 \begin_layout LyX-Code
1929 \begin_layout LyX-Code
1930 tdb_off_t next; /* offset of the next record in the list */
1933 \begin_layout LyX-Code
1934 tdb_len_t rec_len; /* total byte length of record */
1937 \begin_layout LyX-Code
1938 tdb_len_t key_len; /* byte length of key */
1941 \begin_layout LyX-Code
1942 tdb_len_t data_len; /* byte length of data */
1945 \begin_layout LyX-Code
1946 uint32_t full_hash; /* the full 32 bit hash of the key */
1949 \begin_layout LyX-Code
1950 uint32_t magic; /* try to catch errors */
1953 \begin_layout LyX-Code
1954 /* the following union is implied:
1957 \begin_layout LyX-Code
1961 \begin_layout LyX-Code
1962 char record[rec_len];
1965 \begin_layout LyX-Code
1969 \begin_layout LyX-Code
1973 \begin_layout LyX-Code
1974 char data[data_len];
1977 \begin_layout LyX-Code
1981 \begin_layout LyX-Code
1982 uint32_t totalsize; (tailer)
1985 \begin_layout LyX-Code
1989 \begin_layout LyX-Code
1993 \begin_layout LyX-Code
1997 \begin_layout Standard
1998 Naively, this would double to a 56-byte overhead on a 64 bit implementation.
2001 \begin_layout Subsubsection
2005 \begin_layout Standard
2006 We can use various techniques to reduce this for an allocated block:
2009 \begin_layout Enumerate
2010 The 'next' pointer is not required, as we are using a flat hash table.
2013 \begin_layout Enumerate
2014 'rec_len' can instead be expressed as an addition to key_len and data_len
2015 (it accounts for wasted or overallocated length in the record).
2016 Since the record length is always a multiple of 8, we can conveniently
2017 fit it in 32 bits (representing up to 35 bits).
2020 \begin_layout Enumerate
2021 'key_len' and 'data_len' can be reduced.
2022 I'm unwilling to restrict 'data_len' to 32 bits, but instead we can combine
2023 the two into one 64-bit field and using a 5 bit value which indicates at
2024 what bit to divide the two.
2025 Keys are unlikely to scale as fast as data, so I'm assuming a maximum key
2029 \begin_layout Enumerate
2030 'full_hash' is used to avoid a memcmp on the
2031 \begin_inset Quotes eld
2035 \begin_inset Quotes erd
2038 case, but this is diminishing returns after a handful of bits (at 10 bits,
2039 it reduces 99.9% of false memcmp).
2040 As an aside, as the lower bits are already incorporated in the hash table
2041 resolution, the upper bits should be used here.
2043 \change_inserted 0 1283336739
2044 Note that it's not clear that these bits will be a win, given the extra
2045 bits in the hash table itself (see
2046 \begin_inset CommandInset ref
2048 reference "sub:Hash-Size-Solution"
2057 \begin_layout Enumerate
2058 'magic' does not need to be enlarged: it currently reflects one of 5 values
2059 (used, free, dead, recovery, and unused_recovery).
2060 It is useful for quick sanity checking however, and should not be eliminated.
2063 \begin_layout Enumerate
2064 'tailer' is only used to coalesce free blocks (so a block to the right can
2065 find the header to check if this block is free).
2066 This can be replaced by a single 'free' bit in the header of the following
2067 block (and the tailer only exists in free blocks).
2071 \begin_layout Plain Layout
2072 This technique from Thomas Standish.
2073 Data Structure Techniques.
2074 Addison-Wesley, Reading, Massachusetts, 1980.
2079 The current proposed coalescing algorithm doesn't need this, however.
2082 \begin_layout Standard
2083 This produces a 16 byte used header like this:
2086 \begin_layout LyX-Code
2087 struct tdb_used_record {
2090 \begin_layout LyX-Code
2091 uint32_t magic : 16,
2094 \begin_layout LyX-Code
2098 \begin_layout LyX-Code
2102 \begin_layout LyX-Code
2106 \begin_layout LyX-Code
2107 uint32_t extra_octets;
2110 \begin_layout LyX-Code
2111 uint64_t key_and_data_len;
2114 \begin_layout LyX-Code
2118 \begin_layout Standard
2119 And a free record like this:
2122 \begin_layout LyX-Code
2123 struct tdb_free_record {
2126 \begin_layout LyX-Code
2127 uint32_t free_magic;
2130 \begin_layout LyX-Code
2131 uint64_t total_length;
2132 \change_inserted 0 1283337133
2136 \begin_layout LyX-Code
2138 \change_inserted 0 1283337139
2139 uint64_t prev, next;
2144 \begin_layout LyX-Code
2148 \begin_layout LyX-Code
2152 \begin_layout LyX-Code
2156 \begin_layout Standard
2158 \change_inserted 0 1283337235
2159 We might want to take some bits from the used record's top_hash (and the
2160 free record which has 32 bits of padding to spare anyway) if we use variable
2163 \begin_inset CommandInset ref
2165 reference "freelist-in-zone"
2174 \begin_layout Subsection
2175 Transaction Commit Requires 4 fdatasync
2178 \begin_layout Standard
2179 The current transaction algorithm is:
2182 \begin_layout Enumerate
2183 write_recovery_data();
2186 \begin_layout Enumerate
2190 \begin_layout Enumerate
2191 write_recovery_header();
2194 \begin_layout Enumerate
2198 \begin_layout Enumerate
2199 overwrite_with_new_data();
2202 \begin_layout Enumerate
2206 \begin_layout Enumerate
2207 remove_recovery_header();
2210 \begin_layout Enumerate
2214 \begin_layout Standard
2215 On current ext3, each sync flushes all data to disk, so the next 3 syncs
2216 are relatively expensive.
2217 But this could become a performance bottleneck on other filesystems such
2221 \begin_layout Subsubsection
2225 \begin_layout Standard
2226 Neil Brown points out that this is overzealous, and only one sync is needed:
2229 \begin_layout Enumerate
2230 Bundle the recovery data, a transaction counter and a strong checksum of
2234 \begin_layout Enumerate
2235 Strong checksum that whole bundle.
2238 \begin_layout Enumerate
2239 Store the bundle in the database.
2242 \begin_layout Enumerate
2243 Overwrite the oldest of the two recovery pointers in the header (identified
2244 using the transaction counter) with the offset of this bundle.
2247 \begin_layout Enumerate
2251 \begin_layout Enumerate
2252 Write the new data to the file.
2255 \begin_layout Standard
2256 Checking for recovery means identifying the latest bundle with a valid checksum
2257 and using the new data checksum to ensure that it has been applied.
2258 This is more expensive than the current check, but need only be done at
2260 For running databases, a separate header field can be used to indicate
2261 a transaction in progress; we need only check for recovery if this is set.
2264 \begin_layout Subsection
2265 \begin_inset CommandInset label
2267 name "sub:TDB-Does-Not"
2271 TDB Does Not Have Snapshot Support
2274 \begin_layout Subsubsection
2278 \begin_layout Standard
2280 At some point you say
2281 \begin_inset Quotes eld
2285 \begin_inset Quotes erd
2291 \begin_layout Standard
2292 But as a thought experiment, if we implemented transactions to only overwrite
2293 free entries (this is tricky: there must not be a header in each entry
2294 which indicates whether it is free, but use of presence in metadata elsewhere),
2295 and a pointer to the hash table, we could create an entirely new commit
2296 without destroying existing data.
2297 Then it would be easy to implement snapshots in a similar way.
2300 \begin_layout Standard
2301 This would not allow arbitrary changes to the database, such as tdb_repack
2302 does, and would require more space (since we have to preserve the current
2303 and future entries at once).
2304 If we used hash trees rather than one big hash table, we might only have
2305 to rewrite some sections of the hash, too.
2308 \begin_layout Standard
2309 We could then implement snapshots using a similar method, using multiple
2310 different hash tables/free tables.
2313 \begin_layout Subsection
2314 Transactions Cannot Operate in Parallel
2317 \begin_layout Standard
2318 This would be useless for ldb, as it hits the index records with just about
2320 It would add significant complexity in resolving clashes, and cause the
2321 all transaction callers to write their code to loop in the case where the
2322 transactions spuriously failed.
2325 \begin_layout Subsubsection
2329 \begin_layout Standard
2330 We could solve a small part of the problem by providing read-only transactions.
2331 These would allow one write transaction to begin, but it could not commit
2332 until all r/o transactions are done.
2333 This would require a new RO_TRANSACTION_LOCK, which would be upgraded on
2337 \begin_layout Subsection
2338 Default Hash Function Is Suboptimal
2341 \begin_layout Standard
2342 The Knuth-inspired multiplicative hash used by tdb is fairly slow (especially
2343 if we expand it to 64 bits), and works best when the hash bucket size is
2344 a prime number (which also means a slow modulus).
2345 In addition, it is highly predictable which could potentially lead to a
2346 Denial of Service attack in some TDB uses.
2349 \begin_layout Subsubsection
2353 \begin_layout Standard
2354 The Jenkins lookup3 hash
2358 \begin_layout Plain Layout
2359 http://burtleburtle.net/bob/c/lookup3.c
2364 is a fast and superbly-mixing hash.
2365 It's used by the Linux kernel and almost everything else.
2366 This has the particular properties that it takes an initial seed, and produces
2367 two 32 bit hash numbers, which we can combine into a 64-bit hash.
2370 \begin_layout Standard
2371 The seed should be created at tdb-creation time from some random source,
2372 and placed in the header.
2373 This is far from foolproof, but adds a little bit of protection against
2377 \begin_layout Subsection
2378 \begin_inset CommandInset label
2380 name "Reliable-Traversal-Adds"
2384 Reliable Traversal Adds Complexity
2387 \begin_layout Standard
2388 We lock a record during traversal iteration, and try to grab that lock in
2390 If that grab on delete fails, we simply mark it deleted and continue onwards;
2391 traversal checks for this condition and does the delete when it moves off
2395 \begin_layout Standard
2396 If traversal terminates, the dead record may be left indefinitely.
2399 \begin_layout Subsubsection
2403 \begin_layout Standard
2404 Remove reliability guarantees; see
2405 \begin_inset CommandInset ref
2407 reference "traverse-Proposed-Solution"
2414 \begin_layout Subsection
2415 Fcntl Locking Adds Overhead
2418 \begin_layout Standard
2419 Placing a fcntl lock means a system call, as does removing one.
2420 This is actually one reason why transactions can be faster (everything
2421 is locked once at transaction start).
2422 In the uncontended case, this overhead can theoretically be eliminated.
2425 \begin_layout Subsubsection
2429 \begin_layout Standard
2433 \begin_layout Standard
2434 We tried this before with spinlock support, in the early days of TDB, and
2435 it didn't make much difference except in manufactured benchmarks.
2438 \begin_layout Standard
2439 We could use spinlocks (with futex kernel support under Linux), but it means
2440 that we lose automatic cleanup when a process dies with a lock.
2441 There is a method of auto-cleanup under Linux, but it's not supported by
2442 other operating systems.
2443 We could reintroduce a clear-if-first-style lock and sweep for dead futexes
2444 on open, but that wouldn't help the normal case of one concurrent opener
2446 Increasingly elaborate repair schemes could be considered, but they require
2447 an ABI change (everyone must use them) anyway, so there's no need to do
2448 this at the same time as everything else.
2451 \begin_layout Subsection
2452 Some Transactions Don't Require Durability
2455 \begin_layout Standard
2456 Volker points out that gencache uses a CLEAR_IF_FIRST tdb for normal (fast)
2457 usage, and occasionally empties the results into a transactional TDB.
2458 This kind of usage prioritizes performance over durability: as long as
2459 we are consistent, data can be lost.
2462 \begin_layout Standard
2463 This would be more neatly implemented inside tdb: a
2464 \begin_inset Quotes eld
2468 \begin_inset Quotes erd
2471 transaction commit (ie.
2472 syncless) which meant that data may be reverted on a crash.
2475 \begin_layout Subsubsection
2479 \begin_layout Standard
2483 \begin_layout Standard
2484 Unfortunately any transaction scheme which overwrites old data requires
2485 a sync before that overwrite to avoid the possibility of corruption.
2488 \begin_layout Standard
2489 It seems possible to use a scheme similar to that described in
2490 \begin_inset CommandInset ref
2492 reference "sub:TDB-Does-Not"
2496 ,where transactions are committed without overwriting existing data, and
2497 an array of top-level pointers were available in the header.
2498 If the transaction is
2499 \begin_inset Quotes eld
2503 \begin_inset Quotes erd
2506 then we would not need a sync at all: existing processes would pick up
2507 the new hash table and free list and work with that.
2510 \begin_layout Standard
2511 At some later point, a sync would allow recovery of the old data into the
2512 free lists (perhaps when the array of top-level pointers filled).
2513 On crash, tdb_open() would examine the array of top levels, and apply the
2514 transactions until it encountered an invalid checksum.
2524 @Moving hash table does not work.
2531 \begin_layout Plain Layout
2533 \change_inserted 0 1283336450
2534 If we make the hash offsets zone-relative, then this only restricts the
2535 zone size, not the overall database size.
2557 There are three details which become important:
2572 \begin_layout LyX-Code
2578 @Soft transaction commit
2583 \author "Rusty Russell,,,"
2586 \change_deleted 0 1280141199
2588 \change_inserted 0 1280141202
2594 \change_inserted 0 1280140902
2599 \change_inserted 0 1280140661
2603 \change_inserted 0 1280140703
2606 \change_inserted 0 1280708312
2609 \change_inserted 0 1280708400
2612 \change_inserted 0 1280140836
2615 \change_inserted 0 1280708255
2618 \change_inserted 0 1280708374
2621 \change_inserted 0 1280141181
2624 \change_inserted 0 1280141345
2645 @Transaction and freelist rethink.
2650 \author "Rusty Russell,,,"
2656 behavior of disallowing
2657 \change_inserted 0 1272940179
2660 transactions should become the default.
2662 \change_inserted 0 1272944650
2666 \change_inserted 0 1272944763
2675 \change_inserted 0 1273478114
2682 \change_deleted 0 1273469807
2684 \change_inserted 0 1273469810
2688 \change_deleted 0 1273469815
2691 to reduce contention.
2693 \change_inserted 0 1273470006
2697 \change_inserted 0 1273492055
2700 \change_inserted 0 1273483888
2706 \change_deleted 0 1272942055
2707 There are various ways to organize these lisys, but because we want to be
2708 able to quickly identify which free list an entry is in, and reduce the
2709 number of locks required for merging, we will use zoning (eg.
2710 each free list covers some fixed fraction of the file).
2712 \change_inserted 0 1273484187
2716 \change_deleted 0 1273484194
2718 \change_inserted 0 1273484194
2724 Identify the correct
2725 \change_deleted 0 1273482856
2727 \change_inserted 0 1273482857
2734 \change_inserted 0 1273482895
2738 \change_inserted 0 1273482863
2742 \change_inserted 0 1273482909
2746 \change_deleted 0 1273482885
2748 \change_inserted 0 1273482888
2751 lace the freed entry
2752 \change_deleted 0 1273492415
2754 \change_inserted 0 1273492415
2755 in the list for that zone
2760 Allocation is a little more complicated, as we
2761 \change_deleted 0 1273483240
2762 merge entries as we walk the list:
2763 \change_inserted 0 1273484250
2764 perform delayed coalescing at this point:
2770 \change_deleted 0 1273482955
2772 \change_inserted 0 1273482957
2776 \change_deleted 0 1273482962
2778 \change_inserted 0 1273482962
2782 \change_deleted 0 1273482966
2784 \change_inserted 0 1273482966
2791 \change_inserted 0 1273482980
2793 \change_deleted 0 1273482973
2797 \change_inserted 0 1273482982
2801 \change_inserted 0 1273483084
2807 \begin_layout Enumerate
2809 \change_deleted 0 1273492155
2811 \change_inserted 0 1273492159
2814 remove it from the list and return it.
2817 \begin_layout Enumerate
2819 \change_inserted 0 1273492206
2820 coalesce entries in the list.
2821 \change_deleted 0 1273492200
2822 examine the entry to the right of it in the file.
2827 \begin_layout Enumerate
2829 \change_deleted 0 1273492200
2830 If that entry is in a different list, lock that list too.
2833 \begin_layout Enumerate
2835 \change_deleted 0 1273492200
2836 If we had to place a new lock, re-check that the entry is free.
2839 \begin_layout Enumerate
2841 \change_deleted 0 1273492200
2842 Remove that entry from its free list and expand this entry to cover it.
2845 \begin_layout Enumerate
2847 \change_deleted 0 1273485554
2852 \begin_layout Enumerate
2854 \change_inserted 0 1273485311
2855 If there was no entry large enough, unlock the list and try the next zone.
2859 \change_deleted 0 1273483646
2860 Repeat step 3 with each entry in the list.
2866 \change_deleted 0 1273483668
2867 Unlock the list and repeat step 2 with the next list.
2873 \change_deleted 0 1273483671
2875 \change_inserted 0 1273483671
2878 satisfies, expand the file.
2881 This optimizes rapid insert/delete of free list entries
2882 \change_inserted 0 1273485794
2883 by not coalescing them all the time.
2884 \change_deleted 0 1273483685
2885 , and allows us to get rid of the tailer altogether
2889 \change_inserted 0 1273492299
2892 \change_deleted 0 1273476840
2894 \begin_inset Quotes eld
2898 \begin_inset Quotes erd
2901 free entries is more difficult: the 25% overhead works in practice for
2902 ldb because indexes tend to expand by one record at a time.
2903 This can be resolved by having an
2904 \begin_inset Quotes eld
2908 \begin_inset Quotes erd
2911 bit in the header to note entries that have previously expanded, and allocating
2912 more space for them.
2914 \begin_inset Quotes eld
2918 \begin_inset Quotes erd
2921 algorithm should be implemented or first-fit used is still unknown: we
2922 will determine this once these other ideas are implemented.
2923 \change_inserted 0 1273483750
2927 \begin_layout Standard
2929 \change_inserted 0 1273492450
2932 \change_inserted 0 1273470441
2935 \change_inserted 0 1273476556
2938 \change_inserted 0 1273470423
2944 \change_inserted 0 1273476847
2947 \change_inserted 0 1273476886
2950 \change_inserted 0 1273477233
2953 \change_inserted 0 1273477534
2956 \change_inserted 0 1273482700
2959 \change_inserted 0 1273478079
2962 \change_inserted 0 1273477839
2965 \change_inserted 0 1273477925
2968 \change_inserted 0 1273477925
2971 \change_inserted 0 1273477925
2974 \change_inserted 0 1273477925
2977 \change_inserted 0 1273477925
2980 \change_inserted 0 1273477925
2983 \change_inserted 0 1273477925
2986 \change_inserted 0 1273477925
2989 \change_inserted 0 1273477925
2992 \change_inserted 0 1273477925
2995 \change_inserted 0 1273477925
2998 \change_inserted 0 1273477925
3001 \change_inserted 0 1273477925
3004 \change_inserted 0 1273477925
3007 \change_inserted 0 1273477925
3010 \change_inserted 0 1273477925
3013 \change_inserted 0 1273477925
3016 \change_inserted 0 1273477925
3019 \change_inserted 0 1273492522
3022 \change_inserted 0 1273492530
3025 \change_inserted 0 1273492546
3028 \change_inserted 0 1273478239
3031 \change_inserted 0 1273479960
3034 \change_inserted 0 1273480265
3037 \change_inserted 0 1273480354
3040 \change_inserted 0 1273478968
3043 \change_inserted 0 1273492604
3046 \change_inserted 0 1273479572
3052 \change_inserted 0 1273480282
3055 \change_inserted 0 1273478931
3058 \change_inserted 0 1273481549
3061 \change_inserted 0 1273481557
3064 \change_inserted 0 1273480307
3067 \change_inserted 0 1273480335
3070 \change_inserted 0 1273479897
3073 \change_inserted 0 1273479653
3076 \change_inserted 0 1273480371
3079 \change_inserted 0 1273480464
3082 \change_inserted 0 1273480399
3085 \change_inserted 0 1273480425
3088 \change_inserted 0 1273480453
3091 \change_inserted 0 1273480455
3094 \change_inserted 0 1273480450
3097 \change_inserted 0 1273480452
3099 \change_inserted 0 1273478830
3103 \change_deleted 0 1273481604
3104 In theory, we could get away with 2: one after we write the new data, and
3105 one to somehow atomically change over to it.
3106 \change_inserted 0 1273481632
3109 \change_inserted 0 1273481724
3112 \change_inserted 0 1273481713
3115 \change_inserted 0 1273481717
3118 \change_inserted 0 1273481730
3121 \change_inserted 0 1273481736
3124 \change_inserted 0 1273481744
3127 \change_inserted 0 1273481748
3130 \change_inserted 0 1273482185
3133 \change_inserted 0 1273482259
3136 \change_deleted 0 1273481848
3138 Trying to rewrite the transaction code is a separate experiment, which
3139 I encourage someone else to do.
3140 At some point you say
3141 \begin_inset Quotes eld
3145 \begin_inset Quotes erd
3151 \begin_layout Standard
3153 \change_deleted 0 1273481848
3154 But as a thought experiment:
3159 \begin_layout Standard
3161 \change_deleted 0 1273481788
3162 Say there was a pointer in the header which said where the hash table and
3163 free list tables were, and that no blocks were labeled with whether they
3164 were free or not (it had to be derived from what list they were in).
3165 We could create new hash table and free list in some free space, and populate
3166 it as we want the post-committed state to look.
3167 Then we sync, then we switch the offset in the header, then we sync again.
3170 \begin_layout Standard
3172 \change_deleted 0 1273481788
3173 This would not allow arbitrary changes to the database, such as tdb_repack
3174 does, and would require more space (since we have to preserve the current
3175 and future entries at once).
3176 If we used hash trees rather than one big hash table, we might only have
3177 to rewrite some sections of the hash, too.
3178 \change_inserted 0 1273481854
3182 \begin_layout Standard
3184 \change_inserted 0 1273482102
3187 \change_inserted 0 1273482061
3190 \change_inserted 0 1273482063
3193 \change_inserted 0 1273482072
3196 \change_inserted 0 1273482139
3199 \change_inserted 0 1273482364
3202 \change_inserted 0 1273482163
3205 \change_inserted 0 1273482493
3208 \change_inserted 0 1273482536
3214 \change_inserted 0 1273482641
3217 \change_inserted 0 1273481827
3221 \change_inserted 0 1273481829
3224 implement snapshots using a similar method
3225 \change_deleted 0 1273481838
3227 \change_inserted 0 1273481840
3230 using multiple different hash tables/free tables.
3236 @After first feedback (Ronnie & Volker)
3242 The free list should be split into multiple lists to reduce contention.
3247 The algorithm for freeing is simple:
3250 Identify the correct free list.
3253 Lock the list, and place the freed entry at the head.
3256 Allocation is a little more complicated, as we merge entries as we walk
3260 Pick a free list; either the list we last freed onto, or based on a
3266 If the top entry is well-sized, remove it from the list and return it.
3269 Otherwise, examine the entry to the right of it in the file.
3280 If no list satisfies, expand the file.
3283 This optimizes rapid insert/delete of free list entries, and allows us to
3284 get rid of the tailer altogether.
3288 \change_inserted 0 1272941474
3291 \change_inserted 0 1272942759
3292 There are various ways to organize these lists, but because we want to be
3293 able to quickly identify which free list an entry is in, and reduce the
3294 number of locks required for merging, we will use zoning (eg.
3295 each of the N free lists in a tdb file of size M covers a fixed fraction
3297 Note that this means we need to reshuffle the free lists when we expand
3298 the file; this is probably acceptable when we double the hash table size,
3299 since that is such an expensive operation already.
3300 In the case of increasing the file size, there is an optimization we can
3301 use: if we use M in the formula above as the file size rounded up to the
3302 next power of 2, we only need reshuffle free lists when the file size crosses
3303 a power of 2 boundary,
3307 reshuffling the free lists is trivial: we simply merge every consecutive
3321 We could implement snapshots using a similar method to the above, only using
3322 multiple different hash tables/free tables.
3333 #LyX 1.6.4 created this file. For more info see http://www.lyx.org/
3336 \tracking_changes false
3337 \output_changes false
3341 behavior of disallowing transactions should become the default.
3346 The algorithm for freeing is simple: