9 date 2010.12.01.12.22.08; author rusty; state Exp;
14 date 2010.12.01.12.20.49; author rusty; state Exp;
19 date 2010.12.01.11.55.20; author rusty; state Exp;
24 date 2010.09.14.00.33.57; author rusty; state Exp;
29 date 2010.09.09.07.25.12; author rusty; state Exp;
34 date 2010.09.02.02.29.05; author rusty; state Exp;
39 date 2010.09.01.10.58.12; author rusty; state Exp;
44 date 2010.08.02.00.21.43; author rusty; state Exp;
49 date 2010.08.02.00.21.16; author rusty; state Exp;
54 date 2010.05.10.13.09.11; author rusty; state Exp;
59 date 2010.05.10.11.58.37; author rusty; state Exp;
64 date 2010.05.10.05.35.13; author rusty; state Exp;
69 date 2010.05.04.02.29.16; author rusty; state Exp;
84 @#LyX 1.6.7 created this file. For more info see http://www.lyx.org/
89 \use_default_options true
94 \font_typewriter default
95 \font_default_family default
102 \paperfontsize default
110 \paperorientation portrait
113 \paragraph_separation indent
115 \quotes_language english
118 \paperpagestyle default
119 \tracking_changes true
121 \author "Rusty Russell,,,"
128 TDB2: A Redesigning The Trivial DataBase
132 Rusty Russell, IBM Corporation
139 \begin_layout Abstract
140 The Trivial DataBase on-disk format is 32 bits; with usage cases heading
141 towards the 4G limit, that must change.
142 This required breakage provides an opportunity to revisit TDB's other design
143 decisions and reassess them.
146 \begin_layout Section
150 \begin_layout Standard
151 The Trivial DataBase was originally written by Andrew Tridgell as a simple
152 key/data pair storage system with the same API as dbm, but allowing multiple
153 readers and writers while being small enough (< 1000 lines of C) to include
155 The simple design created in 1999 has proven surprisingly robust and performant
156 , used in Samba versions 3 and 4 as well as numerous other projects.
157 Its useful life was greatly increased by the (backwards-compatible!) addition
158 of transaction support in 2005.
161 \begin_layout Standard
162 The wider variety and greater demands of TDB-using code has lead to some
163 organic growth of the API, as well as some compromises on the implementation.
164 None of these, by themselves, are seen as show-stoppers, but the cumulative
165 effect is to a loss of elegance over the initial, simple TDB implementation.
166 Here is a table of the approximate number of lines of implementation code
167 and number of API functions at the end of each year:
170 \begin_layout Standard
172 <lyxtabular version="3" rows="12" columns="3">
174 <column alignment="center" valignment="top" width="0">
175 <column alignment="center" valignment="top" width="0">
176 <column alignment="center" valignment="top" width="0">
178 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
181 \begin_layout Plain Layout
187 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
190 \begin_layout Plain Layout
196 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
199 \begin_layout Plain Layout
200 Lines of C Code Implementation
207 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
210 \begin_layout Plain Layout
216 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
219 \begin_layout Plain Layout
225 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
228 \begin_layout Plain Layout
236 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
239 \begin_layout Plain Layout
245 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
248 \begin_layout Plain Layout
254 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
257 \begin_layout Plain Layout
265 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
268 \begin_layout Plain Layout
274 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
277 \begin_layout Plain Layout
283 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
286 \begin_layout Plain Layout
294 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
297 \begin_layout Plain Layout
303 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
306 \begin_layout Plain Layout
312 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
315 \begin_layout Plain Layout
323 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
326 \begin_layout Plain Layout
332 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
335 \begin_layout Plain Layout
341 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
344 \begin_layout Plain Layout
352 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
355 \begin_layout Plain Layout
361 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
364 \begin_layout Plain Layout
370 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
373 \begin_layout Plain Layout
381 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
384 \begin_layout Plain Layout
390 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
393 \begin_layout Plain Layout
399 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
402 \begin_layout Plain Layout
410 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
413 \begin_layout Plain Layout
419 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
422 \begin_layout Plain Layout
428 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
431 \begin_layout Plain Layout
439 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
442 \begin_layout Plain Layout
448 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
451 \begin_layout Plain Layout
457 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
460 \begin_layout Plain Layout
468 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
471 \begin_layout Plain Layout
477 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
480 \begin_layout Plain Layout
486 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
489 \begin_layout Plain Layout
497 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
500 \begin_layout Plain Layout
506 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
509 \begin_layout Plain Layout
515 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
518 \begin_layout Plain Layout
532 \begin_layout Standard
533 This review is an attempt to catalog and address all the known issues with
534 TDB and create solutions which address the problems without significantly
535 increasing complexity; all involved are far too aware of the dangers of
536 second system syndrome in rewriting a successful project like this.
539 \begin_layout Section
543 \begin_layout Subsection
544 tdb_open_ex Is Not Expandable
547 \begin_layout Standard
548 The tdb_open() call was expanded to tdb_open_ex(), which added an optional
549 hashing function and an optional logging function argument.
550 Additional arguments to open would require the introduction of a tdb_open_ex2
554 \begin_layout Subsubsection
556 \begin_inset CommandInset label
565 \begin_layout Standard
566 tdb_open() will take a linked-list of attributes:
569 \begin_layout LyX-Code
573 \begin_layout LyX-Code
574 TDB_ATTRIBUTE_LOG = 0,
577 \begin_layout LyX-Code
578 TDB_ATTRIBUTE_HASH = 1
581 \begin_layout LyX-Code
585 \begin_layout LyX-Code
586 struct tdb_attribute_base {
589 \begin_layout LyX-Code
590 enum tdb_attribute attr;
593 \begin_layout LyX-Code
594 union tdb_attribute *next;
597 \begin_layout LyX-Code
601 \begin_layout LyX-Code
602 struct tdb_attribute_log {
605 \begin_layout LyX-Code
606 struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG */
609 \begin_layout LyX-Code
613 \begin_layout LyX-Code
617 \begin_layout LyX-Code
621 \begin_layout LyX-Code
622 struct tdb_attribute_hash {
625 \begin_layout LyX-Code
626 struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH */
629 \begin_layout LyX-Code
630 tdb_hash_func hash_fn;
633 \begin_layout LyX-Code
637 \begin_layout LyX-Code
641 \begin_layout LyX-Code
642 union tdb_attribute {
645 \begin_layout LyX-Code
646 struct tdb_attribute_base base;
649 \begin_layout LyX-Code
650 struct tdb_attribute_log log;
653 \begin_layout LyX-Code
654 struct tdb_attribute_hash hash;
657 \begin_layout LyX-Code
661 \begin_layout Standard
662 This allows future attributes to be added, even if this expands the size
666 \begin_layout Subsubsection
670 \begin_layout Standard
674 \begin_layout Subsection
675 tdb_traverse Makes Impossible Guarantees
678 \begin_layout Standard
679 tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, and it
680 was thought that it was important to guarantee that all records which exist
681 at the start and end of the traversal would be included, and no record
682 would be included twice.
685 \begin_layout Standard
686 This adds complexity (see
687 \begin_inset CommandInset ref
689 reference "Reliable-Traversal-Adds"
693 ) and does not work anyway for records which are altered (in particular,
694 those which are expanded may be effectively deleted and re-added behind
698 \begin_layout Subsubsection
699 \begin_inset CommandInset label
701 name "traverse-Proposed-Solution"
708 \begin_layout Standard
709 Abandon the guarantee.
710 You will see every record if no changes occur during your traversal, otherwise
711 you will see some subset.
712 You can prevent changes by using a transaction or the locking API.
715 \begin_layout Subsubsection
719 \begin_layout Standard
721 Delete-during-traverse will still delete every record, too (assuming no
725 \begin_layout Subsection
726 Nesting of Transactions Is Fraught
729 \begin_layout Standard
730 TDB has alternated between allowing nested transactions and not allowing
732 Various paths in the Samba codebase assume that transactions will nest,
733 and in a sense they can: the operation is only committed to disk when the
734 outer transaction is committed.
735 There are two problems, however:
738 \begin_layout Enumerate
739 Canceling the inner transaction will cause the outer transaction commit
740 to fail, and will not undo any operations since the inner transaction began.
741 This problem is soluble with some additional internal code.
744 \begin_layout Enumerate
745 An inner transaction commit can be cancelled by the outer transaction.
746 This is desirable in the way which Samba's database initialization code
747 uses transactions, but could be a surprise to any users expecting a successful
748 transaction commit to expose changes to others.
751 \begin_layout Standard
752 The current solution is to specify the behavior at tdb_open(), with the
753 default currently that nested transactions are allowed.
754 This flag can also be changed at runtime.
757 \begin_layout Subsubsection
761 \begin_layout Standard
762 Given the usage patterns, it seems that the
763 \begin_inset Quotes eld
767 \begin_inset Quotes erd
770 behavior of disallowing nested transactions should become the default.
771 Additionally, it seems the outer transaction is the only code which knows
772 whether inner transactions should be allowed, so a flag to indicate this
773 could be added to tdb_transaction_start.
774 However, this behavior can be simulated with a wrapper which uses tdb_add_flags
775 () and tdb_remove_flags(), so the API should not be expanded for this relatively
779 \begin_layout Subsubsection
783 \begin_layout Standard
784 Incomplete; nesting flag is still defined as per tdb1.
787 \begin_layout Subsection
788 Incorrect Hash Function is Not Detected
791 \begin_layout Standard
792 tdb_open_ex() allows the calling code to specify a different hash function
793 to use, but does not check that all other processes accessing this tdb
794 are using the same hash function.
795 The result is that records are missing from tdb_fetch().
798 \begin_layout Subsubsection
802 \begin_layout Standard
803 The header should contain an example hash result (eg.
804 the hash of 0xdeadbeef), and tdb_open_ex() should check that the given
805 hash function produces the same answer, or fail the tdb_open call.
808 \begin_layout Subsubsection
812 \begin_layout Standard
816 \begin_layout Subsection
817 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
820 \begin_layout Standard
821 In response to scalability issues with the free list (
822 \begin_inset CommandInset ref
824 reference "TDB-Freelist-Is"
828 ) two API workarounds have been incorporated in TDB: tdb_set_max_dead()
829 and the TDB_VOLATILE flag to tdb_open.
830 The latter actually calls the former with an argument of
831 \begin_inset Quotes eld
835 \begin_inset Quotes erd
841 \begin_layout Standard
842 This code allows deleted records to accumulate without putting them in the
844 On delete we iterate through each chain and free them in a batch if there
845 are more than max_dead entries.
846 These are never otherwise recycled except as a side-effect of a tdb_repack.
849 \begin_layout Subsubsection
853 \begin_layout Standard
854 With the scalability problems of the freelist solved, this API can be removed.
855 The TDB_VOLATILE flag may still be useful as a hint that store and delete
856 of records will be at least as common as fetch in order to allow some internal
857 tuning, but initially will become a no-op.
860 \begin_layout Subsubsection
864 \begin_layout Standard
866 TDB_VOLATILE still defined, but implementation should fail on unknown flags
870 \begin_layout Subsection
871 \begin_inset CommandInset label
873 name "TDB-Files-Cannot"
877 TDB Files Cannot Be Opened Multiple Times In The Same Process
880 \begin_layout Standard
881 No process can open the same TDB twice; we check and disallow it.
882 This is an unfortunate side-effect of fcntl locks, which operate on a per-file
883 rather than per-file-descriptor basis, and do not nest.
884 Thus, closing any file descriptor on a file clears all the locks obtained
885 by this process, even if they were placed using a different file descriptor!
888 \begin_layout Standard
889 Note that even if this were solved, deadlock could occur if operations were
890 nested: this is a more manageable programming error in most cases.
893 \begin_layout Subsubsection
897 \begin_layout Standard
898 We could lobby POSIX to fix the perverse rules, or at least lobby Linux
899 to violate them so that the most common implementation does not have this
901 This would be a generally good idea for other fcntl lock users.
904 \begin_layout Standard
905 Samba uses a wrapper which hands out the same tdb_context to multiple callers
906 if this happens, and does simple reference counting.
907 We should do this inside the tdb library, which already emulates lock nesting
908 internally; it would need to recognize when deadlock occurs within a single
910 This would create a new failure mode for tdb operations (while we currently
911 handle locking failures, they are impossible in normal use and a process
912 encountering them can do little but give up).
915 \begin_layout Standard
916 I do not see benefit in an additional tdb_open flag to indicate whether
917 re-opening is allowed, as though there may be some benefit to adding a
918 call to detect when a tdb_context is shared, to allow other to create such
922 \begin_layout Subsubsection
926 \begin_layout Standard
930 \begin_layout Subsection
931 TDB API Is Not POSIX Thread-safe
934 \begin_layout Standard
935 The TDB API uses an error code which can be queried after an operation to
936 determine what went wrong.
937 This programming model does not work with threads, unless specific additional
938 guarantees are given by the implementation.
939 In addition, even otherwise-independent threads cannot open the same TDB
941 \begin_inset CommandInset ref
943 reference "TDB-Files-Cannot"
950 \begin_layout Subsubsection
954 \begin_layout Standard
955 Reachitecting the API to include a tdb_errcode pointer would be a great
956 deal of churn; we are better to guarantee that the tdb_errcode is per-thread
957 so the current programming model can be maintained.
960 \begin_layout Standard
961 This requires dynamic per-thread allocations, which is awkward with POSIX
962 threads (pthread_key_create space is limited and we cannot simply allocate
963 a key for every TDB).
966 \begin_layout Standard
967 Internal locking is required to make sure that fcntl locks do not overlap
968 between threads, and also that the global list of tdbs is maintained.
971 \begin_layout Standard
972 The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe
973 version of the library, and otherwise no overhead will exist.
974 Alternatively, a hooking mechanism similar to that proposed for
975 \begin_inset CommandInset ref
977 reference "Proposed-Solution-locking-hook"
981 could be used to enable pthread locking at runtime.
984 \begin_layout Subsubsection
988 \begin_layout Standard
992 \begin_layout Subsection
993 *_nonblock Functions And *_mark Functions Expose Implementation
996 \begin_layout Standard
1001 \begin_layout Plain Layout
1002 Clustered TDB, see http://ctdb.samba.org
1007 wishes to operate on TDB in a non-blocking manner.
1008 This is currently done as follows:
1011 \begin_layout Enumerate
1012 Call the _nonblock variant of an API function (eg.
1013 tdb_lockall_nonblock).
1017 \begin_layout Enumerate
1018 Fork a child process, and wait for it to call the normal variant (eg.
1022 \begin_layout Enumerate
1023 If the child succeeds, call the _mark variant to indicate we already have
1028 \begin_layout Enumerate
1029 Upon completion, tell the child to release the locks (eg.
1033 \begin_layout Enumerate
1034 Indicate to tdb that it should consider the locks removed (eg.
1035 tdb_unlockall_mark).
1038 \begin_layout Standard
1039 There are several issues with this approach.
1040 Firstly, adding two new variants of each function clutters the API for
1041 an obscure use, and so not all functions have three variants.
1042 Secondly, it assumes that all paths of the functions ask for the same locks,
1043 otherwise the parent process will have to get a lock which the child doesn't
1044 have under some circumstances.
1045 I don't believe this is currently the case, but it constrains the implementatio
1050 \begin_layout Subsubsection
1051 \begin_inset CommandInset label
1053 name "Proposed-Solution-locking-hook"
1060 \begin_layout Standard
1061 Implement a hook for locking methods, so that the caller can control the
1062 calls to create and remove fcntl locks.
1063 In this scenario, ctdbd would operate as follows:
1066 \begin_layout Enumerate
1067 Call the normal API function, eg tdb_lockall().
1070 \begin_layout Enumerate
1071 When the lock callback comes in, check if the child has the lock.
1072 Initially, this is always false.
1074 Otherwise, try to obtain it in non-blocking mode.
1075 If that fails, return EWOULDBLOCK.
1078 \begin_layout Enumerate
1079 Release locks in the unlock callback as normal.
1082 \begin_layout Enumerate
1083 If tdb_lockall() fails, see if we recorded a lock failure; if so, call the
1084 child to repeat the operation.
1087 \begin_layout Enumerate
1088 The child records what locks it obtains, and returns that information to
1092 \begin_layout Enumerate
1093 When the child has succeeded, goto 1.
1096 \begin_layout Standard
1097 This is flexible enough to handle any potential locking scenario, even when
1098 lock requirements change.
1099 It can be optimized so that the parent does not release locks, just tells
1100 the child which locks it doesn't need to obtain.
1103 \begin_layout Standard
1104 It also keeps the complexity out of the API, and in ctdbd where it is needed.
1107 \begin_layout Subsubsection
1111 \begin_layout Standard
1115 \begin_layout Subsection
1116 tdb_chainlock Functions Expose Implementation
1119 \begin_layout Standard
1120 tdb_chainlock locks some number of records, including the record indicated
1122 This gave atomicity guarantees; no-one can start a transaction, alter,
1123 read or delete that key while the lock is held.
1126 \begin_layout Standard
1127 It also makes the same guarantee for any other key in the chain, which is
1128 an internal implementation detail and potentially a cause for deadlock.
1131 \begin_layout Subsubsection
1135 \begin_layout Standard
1137 It would be nice to have an explicit single entry lock which effected no
1139 Unfortunately, this won't work for an entry which doesn't exist.
1140 Thus while chainlock may be implemented more efficiently for the existing
1141 case, it will still have overlap issues with the non-existing case.
1142 So it is best to keep the current (lack of) guarantee about which records
1143 will be effected to avoid constraining our implementation.
1146 \begin_layout Subsection
1147 Signal Handling is Not Race-Free
1150 \begin_layout Standard
1151 The tdb_setalarm_sigptr() call allows the caller's signal handler to indicate
1152 that the tdb locking code should return with a failure, rather than trying
1153 again when a signal is received (and errno == EAGAIN).
1154 This is usually used to implement timeouts.
1157 \begin_layout Standard
1158 Unfortunately, this does not work in the case where the signal is received
1159 before the tdb code enters the fcntl() call to place the lock: the code
1160 will sleep within the fcntl() code, unaware that the signal wants it to
1162 In the case of long timeouts, this does not happen in practice.
1165 \begin_layout Subsubsection
1169 \begin_layout Standard
1170 The locking hooks proposed in
1171 \begin_inset CommandInset ref
1173 reference "Proposed-Solution-locking-hook"
1177 would allow the user to decide on whether to fail the lock acquisition
1179 This allows the caller to choose their own compromise: they could narrow
1180 the race by checking immediately before the fcntl call.
1184 \begin_layout Plain Layout
1185 It may be possible to make this race-free in some implementations by having
1186 the signal handler alter the struct flock to make it invalid.
1187 This will cause the fcntl() lock call to fail with EINVAL if the signal
1188 occurs before the kernel is entered, otherwise EAGAIN.
1196 \begin_layout Subsubsection
1200 \begin_layout Standard
1204 \begin_layout Subsection
1205 The API Uses Gratuitous Typedefs, Capitals
1208 \begin_layout Standard
1209 typedefs are useful for providing source compatibility when types can differ
1210 across implementations, or arguably in the case of function pointer definitions
1211 which are hard for humans to parse.
1212 Otherwise it is simply obfuscation and pollutes the namespace.
1215 \begin_layout Standard
1216 Capitalization is usually reserved for compile-time constants and macros.
1219 \begin_layout Description
1220 TDB_CONTEXT There is no reason to use this over 'struct tdb_context'; the
1221 definition isn't visible to the API user anyway.
1224 \begin_layout Description
1225 TDB_DATA There is no reason to use this over struct TDB_DATA; the struct
1226 needs to be understood by the API user.
1229 \begin_layout Description
1231 \begin_inset space ~
1234 TDB_DATA This would normally be called 'struct tdb_data'.
1237 \begin_layout Description
1239 \begin_inset space ~
1242 TDB_ERROR Similarly, this would normally be enum tdb_error.
1245 \begin_layout Subsubsection
1249 \begin_layout Standard
1251 Introducing lower case variants would please pedants like myself, but if
1252 it were done the existing ones should be kept.
1253 There is little point forcing a purely cosmetic change upon tdb users.
1256 \begin_layout Subsection
1257 \begin_inset CommandInset label
1259 name "tdb_log_func-Doesnt-Take"
1263 tdb_log_func Doesn't Take The Private Pointer
1266 \begin_layout Standard
1267 For API compatibility reasons, the logging function needs to call tdb_get_loggin
1268 g_private() to retrieve the pointer registered by the tdb_open_ex for logging.
1271 \begin_layout Subsubsection
1275 \begin_layout Standard
1276 It should simply take an extra argument, since we are prepared to break
1280 \begin_layout Subsubsection
1284 \begin_layout Standard
1288 \begin_layout Subsection
1289 Various Callback Functions Are Not Typesafe
1292 \begin_layout Standard
1293 The callback functions in tdb_set_logging_function (after
1294 \begin_inset CommandInset ref
1296 reference "tdb_log_func-Doesnt-Take"
1300 is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read and tdb_check
1301 all take void * and must internally convert it to the argument type they
1305 \begin_layout Standard
1306 If this type changes, the compiler will not produce warnings on the callers,
1307 since it only sees void *.
1310 \begin_layout Subsubsection
1314 \begin_layout Standard
1315 With careful use of macros, we can create callback functions which give
1316 a warning when used on gcc and the types of the callback and its private
1318 Unsupported compilers will not give a warning, which is no worse than now.
1319 In addition, the callbacks become clearer, as they need not use void *
1320 for their parameter.
1323 \begin_layout Standard
1324 See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html
1327 \begin_layout Subsubsection
1331 \begin_layout Standard
1335 \begin_layout Subsection
1336 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic
1339 \begin_layout Standard
1340 The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB file should
1341 be cleared if the caller discovers it is the only process with the TDB
1343 However, if any caller does not specify TDB_CLEAR_IF_FIRST it will not
1344 be detected, so will have the TDB erased underneath them (usually resulting
1348 \begin_layout Standard
1349 There is a similar issue on fork(); if the parent exits (or otherwise closes
1350 the tdb) before the child calls tdb_reopen_all() to establish the lock
1351 used to indicate the TDB is opened by someone, a TDB_CLEAR_IF_FIRST opener
1352 at that moment will believe it alone has opened the TDB and will erase
1356 \begin_layout Subsubsection
1360 \begin_layout Standard
1361 Remove TDB_CLEAR_IF_FIRST.
1362 Other workarounds are possible, but see
1363 \begin_inset CommandInset ref
1365 reference "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1372 \begin_layout Subsubsection
1376 \begin_layout Standard
1377 Incomplete, TDB_CLEAR_IF_FIRST still defined, but not implemented.
1380 \begin_layout Subsection
1381 Extending The Header Is Difficult
1384 \begin_layout Standard
1385 We have reserved (zeroed) words in the TDB header, which can be used for
1387 If the future features are compulsory, the version number must be updated
1388 to prevent old code from accessing the database.
1389 But if the future feature is optional, we have no way of telling if older
1390 code is accessing the database or not.
1393 \begin_layout Subsubsection
1397 \begin_layout Standard
1398 The header should contain a
1399 \begin_inset Quotes eld
1403 \begin_inset Quotes erd
1407 This is divided into two 32-bit parts:
1410 \begin_layout Enumerate
1411 The lower part reflects the format variant understood by code accessing
1415 \begin_layout Enumerate
1416 The upper part reflects the format variant you must understand to write
1417 to the database (otherwise you can only open for reading).
1420 \begin_layout Standard
1421 The latter field can only be written at creation time, the former should
1422 be written under the OPEN_LOCK when opening the database for writing, if
1423 the variant of the code is lower than the current lowest variant.
1426 \begin_layout Standard
1427 This should allow backwards-compatible features to be added, and detection
1428 if older code (which doesn't understand the feature) writes to the database.
1431 \begin_layout Subsubsection
1435 \begin_layout Standard
1439 \begin_layout Subsection
1440 Record Headers Are Not Expandible
1443 \begin_layout Standard
1444 If we later want to add (say) checksums on keys and data, it would require
1445 another format change, which we'd like to avoid.
1448 \begin_layout Subsubsection
1452 \begin_layout Standard
1453 We often have extra padding at the tail of a record.
1454 If we ensure that the first byte (if any) of this padding is zero, we will
1455 have a way for future changes to detect code which doesn't understand a
1456 new format: the new code would write (say) a 1 at the tail, and thus if
1457 there is no tail or the first byte is 0, we would know the extension is
1458 not present on that record.
1461 \begin_layout Subsubsection
1465 \begin_layout Standard
1469 \begin_layout Subsection
1470 TDB Does Not Use Talloc
1473 \begin_layout Standard
1474 Many users of TDB (particularly Samba) use the talloc allocator, and thus
1475 have to wrap TDB in a talloc context to use it conveniently.
1478 \begin_layout Subsubsection
1482 \begin_layout Standard
1483 The allocation within TDB is not complicated enough to justify the use of
1484 talloc, and I am reluctant to force another (excellent) library on TDB
1486 Nonetheless a compromise is possible.
1488 \begin_inset CommandInset ref
1490 reference "attributes"
1494 ) can be added later to tdb_open() to provide an alternate allocation mechanism,
1495 specifically for talloc but usable by any other allocator (which would
1497 \begin_inset Quotes eld
1501 \begin_inset Quotes erd
1507 \begin_layout Standard
1508 This would form a talloc heirarchy as expected, but the caller would still
1509 have to attach a destructor to the tdb context returned from tdb_open to
1511 All TDB_DATA fields would be children of the tdb_context, and the caller
1512 would still have to manage them (using talloc_free() or talloc_steal()).
1515 \begin_layout Subsubsection
1519 \begin_layout Standard
1523 \begin_layout Section
1524 Performance And Scalability Issues
1527 \begin_layout Subsection
1528 \begin_inset CommandInset label
1530 name "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1534 TDB_CLEAR_IF_FIRST Imposes Performance Penalty
1537 \begin_layout Standard
1538 When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is placed at offset
1541 While these locks never conflict in normal tdb usage, they do add substantial
1542 overhead for most fcntl lock implementations when the kernel scans to detect
1543 if a lock conflict exists.
1544 This is often a single linked list, making the time to acquire and release
1545 a fcntl lock O(N) where N is the number of processes with the TDB open,
1546 not the number actually doing work.
1549 \begin_layout Standard
1550 In a Samba server it is common to have huge numbers of clients sitting idle,
1551 and thus they have weaned themselves off the TDB_CLEAR_IF_FIRST flag.
1555 \begin_layout Plain Layout
1556 There is a flag to tdb_reopen_all() which is used for this optimization:
1557 if the parent process will outlive the child, the child does not need the
1559 This is a workaround for this very performance issue.
1567 \begin_layout Subsubsection
1571 \begin_layout Standard
1573 It was a neat idea, but even trivial servers tend to know when they are
1574 initializing for the first time and can simply unlink the old tdb at that
1578 \begin_layout Subsubsection
1582 \begin_layout Standard
1583 Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing.
1586 \begin_layout Subsection
1587 TDB Files Have a 4G Limit
1590 \begin_layout Standard
1591 This seems to be becoming an issue (so much for
1592 \begin_inset Quotes eld
1596 \begin_inset Quotes erd
1599 !), particularly for ldb.
1602 \begin_layout Subsubsection
1606 \begin_layout Standard
1607 A new, incompatible TDB format which uses 64 bit offsets internally rather
1609 For simplicity of endian conversion (which TDB does on the fly if required),
1610 all values will be 64 bit on disk.
1611 In practice, some upper bits may be used for other purposes, but at least
1612 56 bits will be available for file offsets.
1615 \begin_layout Standard
1616 tdb_open() will automatically detect the old version, and even create them
1617 if TDB_VERSION6 is specified to tdb_open.
1620 \begin_layout Standard
1621 32 bit processes will still be able to access TDBs larger than 4G (assuming
1622 that their off_t allows them to seek to 64 bits), they will gracefully
1623 fall back as they fail to mmap.
1624 This can happen already with large TDBs.
1627 \begin_layout Standard
1628 Old versions of tdb will fail to open the new TDB files (since 28 August
1629 2009, commit 398d0c29290: prior to that any unrecognized file format would
1630 be erased and initialized as a fresh tdb!)
1633 \begin_layout Subsubsection
1637 \begin_layout Standard
1641 \begin_layout Subsection
1642 TDB Records Have a 4G Limit
1645 \begin_layout Standard
1646 This has not been a reported problem, and the API uses size_t which can
1647 be 64 bit on 64 bit platforms.
1648 However, other limits may have made such an issue moot.
1651 \begin_layout Subsubsection
1655 \begin_layout Standard
1656 Record sizes will be 64 bit, with an error returned on 32 bit platforms
1657 which try to access such records (the current implementation would return
1658 TDB_ERR_OOM in a similar case).
1659 It seems unlikely that 32 bit keys will be a limitation, so the implementation
1660 may not support this (see
1661 \begin_inset CommandInset ref
1663 reference "sub:Records-Incur-A"
1670 \begin_layout Subsubsection
1674 \begin_layout Standard
1678 \begin_layout Subsection
1679 Hash Size Is Determined At TDB Creation Time
1682 \begin_layout Standard
1683 TDB contains a number of hash chains in the header; the number is specified
1684 at creation time, and defaults to 131.
1685 This is such a bottleneck on large databases (as each hash chain gets quite
1686 long), that LDB uses 10,000 for this hash.
1687 In general it is impossible to know what the 'right' answer is at database
1691 \begin_layout Subsubsection
1692 \begin_inset CommandInset label
1694 name "sub:Hash-Size-Solution"
1701 \begin_layout Standard
1702 After comprehensive performance testing on various scalable hash variants
1706 \begin_layout Plain Layout
1707 http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying
1708 because I was previously convinced that an expanding tree of hashes would
1709 be very close to optimal.
1714 , it became clear that it is hard to beat a straight linear hash table which
1715 doubles in size when it reaches saturation.
1716 Unfortunately, altering the hash table introduces serious locking complications
1717 : the entire hash table needs to be locked to enlarge the hash table, and
1718 others might be holding locks.
1719 Particularly insidious are insertions done under tdb_chainlock.
1722 \begin_layout Standard
1723 Thus an expanding layered hash will be used: an array of hash groups, with
1724 each hash group exploding into pointers to lower hash groups once it fills,
1725 turning into a hash tree.
1726 This has implications for locking: we must lock the entire group in case
1727 we need to expand it, yet we don't know how deep the tree is at that point.
1730 \begin_layout Standard
1731 Note that bits from the hash table entries should be stolen to hold more
1732 hash bits to reduce the penalty of collisions.
1733 We can use the otherwise-unused lower 3 bits.
1734 If we limit the size of the database to 64 exabytes, we can use the top
1735 8 bits of the hash entry as well.
1736 These 11 bits would reduce false positives down to 1 in 2000 which is more
1737 than we need: we can use one of the bits to indicate that the extra hash
1739 This means we can choose not to re-hash all entries when we expand a hash
1740 group; simply use the next bits we need and mark them invalid.
1743 \begin_layout Subsubsection
1747 \begin_layout Standard
1751 \begin_layout Subsection
1752 \begin_inset CommandInset label
1754 name "TDB-Freelist-Is"
1758 TDB Freelist Is Highly Contended
1761 \begin_layout Standard
1762 TDB uses a single linked list for the free list.
1763 Allocation occurs as follows, using heuristics which have evolved over
1767 \begin_layout Enumerate
1768 Get the free list lock for this whole operation.
1771 \begin_layout Enumerate
1772 Multiply length by 1.25, so we always over-allocate by 25%.
1775 \begin_layout Enumerate
1776 Set the slack multiplier to 1.
1779 \begin_layout Enumerate
1780 Examine the current freelist entry: if it is > length but < the current
1781 best case, remember it as the best case.
1784 \begin_layout Enumerate
1785 Multiply the slack multiplier by 1.05.
1788 \begin_layout Enumerate
1789 If our best fit so far is less than length * slack multiplier, return it.
1790 The slack will be turned into a new free record if it's large enough.
1793 \begin_layout Enumerate
1794 Otherwise, go onto the next freelist entry.
1797 \begin_layout Standard
1798 Deleting a record occurs as follows:
1801 \begin_layout Enumerate
1802 Lock the hash chain for this whole operation.
1805 \begin_layout Enumerate
1806 Walk the chain to find the record, keeping the prev pointer offset.
1809 \begin_layout Enumerate
1810 If max_dead is non-zero:
1814 \begin_layout Enumerate
1815 Walk the hash chain again and count the dead records.
1818 \begin_layout Enumerate
1819 If it's more than max_dead, bulk free all the dead ones (similar to steps
1820 4 and below, but the lock is only obtained once).
1823 \begin_layout Enumerate
1824 Simply mark this record as dead and return.
1829 \begin_layout Enumerate
1830 Get the free list lock for the remainder of this operation.
1833 \begin_layout Enumerate
1834 \begin_inset CommandInset label
1836 name "right-merging"
1840 Examine the following block to see if it is free; if so, enlarge the current
1841 block and remove that block from the free list.
1842 This was disabled, as removal from the free list was O(entries-in-free-list).
1845 \begin_layout Enumerate
1846 Examine the preceeding block to see if it is free: for this reason, each
1847 block has a 32-bit tailer which indicates its length.
1848 If it is free, expand it to cover our new block and return.
1851 \begin_layout Enumerate
1852 Otherwise, prepend ourselves to the free list.
1855 \begin_layout Standard
1856 Disabling right-merging (step
1857 \begin_inset CommandInset ref
1859 reference "right-merging"
1863 ) causes fragmentation; the other heuristics proved insufficient to address
1864 this, so the final answer to this was that when we expand the TDB file
1865 inside a transaction commit, we repack the entire tdb.
1868 \begin_layout Standard
1869 The single list lock limits our allocation rate; due to the other issues
1870 this is not currently seen as a bottleneck.
1873 \begin_layout Subsubsection
1877 \begin_layout Standard
1878 The first step is to remove all the current heuristics, as they obviously
1879 interact, then examine them once the lock contention is addressed.
1882 \begin_layout Standard
1883 The free list must be split to reduce contention.
1884 Assuming perfect free merging, we can at most have 1 free list entry for
1886 This implies that the number of free lists is related to the size of the
1887 hash table, but as it is rare to walk a large number of free list entries
1888 we can use far fewer, say 1/32 of the number of hash buckets.
1891 \begin_layout Standard
1892 It seems tempting to try to reuse the hash implementation which we use for
1893 records here, but we have two ways of searching for free entries: for allocatio
1894 n we search by size (and possibly zone) which produces too many clashes
1895 for our hash table to handle well, and for coalescing we search by address.
1896 Thus an array of doubly-linked free lists seems preferable.
1899 \begin_layout Standard
1900 There are various benefits in using per-size free lists (see
1901 \begin_inset CommandInset ref
1903 reference "sub:TDB-Becomes-Fragmented"
1907 ) but it's not clear this would reduce contention in the common case where
1908 all processes are allocating/freeing the same size.
1909 Thus we almost certainly need to divide in other ways: the most obvious
1910 is to divide the file into zones, and using a free list (or table of free
1912 This approximates address ordering.
1915 \begin_layout Standard
1916 Unfortunately it is difficult to know what heuristics should be used to
1917 determine zone sizes, and our transaction code relies on being able to
1919 \begin_inset Quotes eld
1923 \begin_inset Quotes erd
1926 by simply appending to the file (difficult if it would need to create a
1928 Thus we use a linked-list of free tables; currently we only ever create
1929 one, but if there is more than one we choose one at random to use.
1930 In future we may use heuristics to add new free tables on contention.
1931 We only expand the file when all free tables are exhausted.
1934 \begin_layout Standard
1935 The basic algorithm is as follows.
1939 \begin_layout Enumerate
1940 Identify the correct free list.
1943 \begin_layout Enumerate
1944 Lock the corresponding list.
1947 \begin_layout Enumerate
1948 Re-check the list (we didn't have a lock, sizes could have changed): relock
1952 \begin_layout Enumerate
1953 Place the freed entry in the list.
1956 \begin_layout Standard
1957 Allocation is a little more complicated, as we perform delayed coalescing
1961 \begin_layout Enumerate
1962 Pick a free table; usually the previous one.
1965 \begin_layout Enumerate
1966 Lock the corresponding list.
1969 \begin_layout Enumerate
1970 If the top entry is -large enough, remove it from the list and return it.
1973 \begin_layout Enumerate
1974 Otherwise, coalesce entries in the list.If there was no entry large enough,
1975 unlock the list and try the next largest list
1978 \begin_layout Enumerate
1979 If no list has an entry which meets our needs, try the next free table.
1982 \begin_layout Enumerate
1983 If no zone satisfies, expand the file.
1986 \begin_layout Standard
1987 This optimizes rapid insert/delete of free list entries by not coalescing
1989 First-fit address ordering ordering seems to be fairly good for keeping
1990 fragmentation low (see
1991 \begin_inset CommandInset ref
1993 reference "sub:TDB-Becomes-Fragmented"
1998 Note that address ordering does not need a tailer to coalesce, though if
1999 we needed one we could have one cheaply: see
2000 \begin_inset CommandInset ref
2002 reference "sub:Records-Incur-A"
2010 \begin_layout Standard
2011 Each free entry has the free table number in the header: less than 255.
2012 It also contains a doubly-linked list for easy deletion.
2015 \begin_layout Subsection
2016 \begin_inset CommandInset label
2018 name "sub:TDB-Becomes-Fragmented"
2022 TDB Becomes Fragmented
2025 \begin_layout Standard
2026 Much of this is a result of allocation strategy
2030 \begin_layout Plain Layout
2031 The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 ftp://ftp.cs.ute
2032 xas.edu/pub/garbage/malloc/ismm98.ps
2037 and deliberate hobbling of coalescing; internal fragmentation (aka overallocati
2038 on) is deliberately set at 25%, and external fragmentation is only cured
2039 by the decision to repack the entire db when a transaction commit needs
2040 to enlarge the file.
2043 \begin_layout Subsubsection
2047 \begin_layout Standard
2048 The 25% overhead on allocation works in practice for ldb because indexes
2049 tend to expand by one record at a time.
2050 This internal fragmentation can be resolved by having an
2051 \begin_inset Quotes eld
2055 \begin_inset Quotes erd
2058 bit in the header to note entries that have previously expanded, and allocating
2059 more space for them.
2062 \begin_layout Standard
2063 There are is a spectrum of possible solutions for external fragmentation:
2064 one is to use a fragmentation-avoiding allocation strategy such as best-fit
2065 address-order allocator.
2066 The other end of the spectrum would be to use a bump allocator (very fast
2067 and simple) and simply repack the file when we reach the end.
2070 \begin_layout Standard
2071 There are three problems with efficient fragmentation-avoiding allocators:
2072 they are non-trivial, they tend to use a single free list for each size,
2073 and there's no evidence that tdb allocation patterns will match those recorded
2074 for general allocators (though it seems likely).
2077 \begin_layout Standard
2078 Thus we don't spend too much effort on external fragmentation; we will be
2079 no worse than the current code if we need to repack on occasion.
2080 More effort is spent on reducing freelist contention, and reducing overhead.
2083 \begin_layout Subsection
2084 \begin_inset CommandInset label
2086 name "sub:Records-Incur-A"
2090 Records Incur A 28-Byte Overhead
2093 \begin_layout Standard
2094 Each TDB record has a header as follows:
2097 \begin_layout LyX-Code
2101 \begin_layout LyX-Code
2102 tdb_off_t next; /* offset of the next record in the list */
2105 \begin_layout LyX-Code
2106 tdb_len_t rec_len; /* total byte length of record */
2109 \begin_layout LyX-Code
2110 tdb_len_t key_len; /* byte length of key */
2113 \begin_layout LyX-Code
2114 tdb_len_t data_len; /* byte length of data */
2117 \begin_layout LyX-Code
2118 uint32_t full_hash; /* the full 32 bit hash of the key */
2121 \begin_layout LyX-Code
2122 uint32_t magic; /* try to catch errors */
2125 \begin_layout LyX-Code
2126 /* the following union is implied:
2129 \begin_layout LyX-Code
2133 \begin_layout LyX-Code
2134 char record[rec_len];
2137 \begin_layout LyX-Code
2141 \begin_layout LyX-Code
2145 \begin_layout LyX-Code
2146 char data[data_len];
2149 \begin_layout LyX-Code
2153 \begin_layout LyX-Code
2154 uint32_t totalsize; (tailer)
2157 \begin_layout LyX-Code
2161 \begin_layout LyX-Code
2165 \begin_layout LyX-Code
2169 \begin_layout Standard
2170 Naively, this would double to a 56-byte overhead on a 64 bit implementation.
2173 \begin_layout Subsubsection
2177 \begin_layout Standard
2178 We can use various techniques to reduce this for an allocated block:
2181 \begin_layout Enumerate
2182 The 'next' pointer is not required, as we are using a flat hash table.
2185 \begin_layout Enumerate
2186 'rec_len' can instead be expressed as an addition to key_len and data_len
2187 (it accounts for wasted or overallocated length in the record).
2188 Since the record length is always a multiple of 8, we can conveniently
2189 fit it in 32 bits (representing up to 35 bits).
2192 \begin_layout Enumerate
2193 'key_len' and 'data_len' can be reduced.
2194 I'm unwilling to restrict 'data_len' to 32 bits, but instead we can combine
2195 the two into one 64-bit field and using a 5 bit value which indicates at
2196 what bit to divide the two.
2197 Keys are unlikely to scale as fast as data, so I'm assuming a maximum key
2201 \begin_layout Enumerate
2202 'full_hash' is used to avoid a memcmp on the
2203 \begin_inset Quotes eld
2207 \begin_inset Quotes erd
2210 case, but this is diminishing returns after a handful of bits (at 10 bits,
2211 it reduces 99.9% of false memcmp).
2212 As an aside, as the lower bits are already incorporated in the hash table
2213 resolution, the upper bits should be used here.
2214 Note that it's not clear that these bits will be a win, given the extra
2215 bits in the hash table itself (see
2216 \begin_inset CommandInset ref
2218 reference "sub:Hash-Size-Solution"
2225 \begin_layout Enumerate
2226 'magic' does not need to be enlarged: it currently reflects one of 5 values
2227 (used, free, dead, recovery, and unused_recovery).
2228 It is useful for quick sanity checking however, and should not be eliminated.
2231 \begin_layout Enumerate
2232 'tailer' is only used to coalesce free blocks (so a block to the right can
2233 find the header to check if this block is free).
2234 This can be replaced by a single 'free' bit in the header of the following
2235 block (and the tailer only exists in free blocks).
2239 \begin_layout Plain Layout
2240 This technique from Thomas Standish.
2241 Data Structure Techniques.
2242 Addison-Wesley, Reading, Massachusetts, 1980.
2247 The current proposed coalescing algorithm doesn't need this, however.
2250 \begin_layout Standard
2251 This produces a 16 byte used header like this:
2254 \begin_layout LyX-Code
2255 struct tdb_used_record {
2258 \begin_layout LyX-Code
2259 uint32_t used_magic : 16,
2262 \begin_layout LyX-Code
2266 \begin_layout LyX-Code
2270 \begin_layout LyX-Code
2274 \begin_layout LyX-Code
2275 uint32_t extra_octets;
2278 \begin_layout LyX-Code
2279 uint64_t key_and_data_len;
2282 \begin_layout LyX-Code
2286 \begin_layout Standard
2287 And a free record like this:
2290 \begin_layout LyX-Code
2291 struct tdb_free_record {
2294 \begin_layout LyX-Code
2295 uint64_t free_magic: 8,
2298 \begin_layout LyX-Code
2302 \begin_layout LyX-Code
2306 \begin_layout LyX-Code
2307 uint64_t free_table: 8,
2310 \begin_layout LyX-Code
2314 \begin_layout LyX-Code
2318 \begin_layout LyX-Code
2322 \begin_layout Standard
2324 \change_deleted 0 1291206079
2327 Note that by limiting valid offsets to 56 bits, we can pack everything we
2328 need into 3 64-byte words, meaning our minimum record size is 8 bytes.
2331 \begin_layout Subsubsection
2335 \begin_layout Standard
2339 \begin_layout Subsection
2340 Transaction Commit Requires 4 fdatasync
2343 \begin_layout Standard
2344 The current transaction algorithm is:
2347 \begin_layout Enumerate
2348 write_recovery_data();
2351 \begin_layout Enumerate
2355 \begin_layout Enumerate
2356 write_recovery_header();
2359 \begin_layout Enumerate
2363 \begin_layout Enumerate
2364 overwrite_with_new_data();
2367 \begin_layout Enumerate
2371 \begin_layout Enumerate
2372 remove_recovery_header();
2375 \begin_layout Enumerate
2379 \begin_layout Standard
2380 On current ext3, each sync flushes all data to disk, so the next 3 syncs
2381 are relatively expensive.
2382 But this could become a performance bottleneck on other filesystems such
2386 \begin_layout Subsubsection
2390 \begin_layout Standard
2391 Neil Brown points out that this is overzealous, and only one sync is needed:
2394 \begin_layout Enumerate
2395 Bundle the recovery data, a transaction counter and a strong checksum of
2399 \begin_layout Enumerate
2400 Strong checksum that whole bundle.
2403 \begin_layout Enumerate
2404 Store the bundle in the database.
2407 \begin_layout Enumerate
2408 Overwrite the oldest of the two recovery pointers in the header (identified
2409 using the transaction counter) with the offset of this bundle.
2412 \begin_layout Enumerate
2416 \begin_layout Enumerate
2417 Write the new data to the file.
2420 \begin_layout Standard
2421 Checking for recovery means identifying the latest bundle with a valid checksum
2422 and using the new data checksum to ensure that it has been applied.
2423 This is more expensive than the current check, but need only be done at
2425 For running databases, a separate header field can be used to indicate
2426 a transaction in progress; we need only check for recovery if this is set.
2429 \begin_layout Subsubsection
2433 \begin_layout Standard
2437 \begin_layout Subsection
2438 \begin_inset CommandInset label
2440 name "sub:TDB-Does-Not"
2444 TDB Does Not Have Snapshot Support
2447 \begin_layout Subsubsection
2448 Proposed SolutionNone.
2449 At some point you say
2450 \begin_inset Quotes eld
2454 \begin_inset Quotes erd
2458 \begin_inset CommandInset ref
2460 reference "replay-attribute"
2467 \begin_layout Standard
2468 But as a thought experiment, if we implemented transactions to only overwrite
2469 free entries (this is tricky: there must not be a header in each entry
2470 which indicates whether it is free, but use of presence in metadata elsewhere),
2471 and a pointer to the hash table, we could create an entirely new commit
2472 without destroying existing data.
2473 Then it would be easy to implement snapshots in a similar way.
2476 \begin_layout Standard
2477 This would not allow arbitrary changes to the database, such as tdb_repack
2478 does, and would require more space (since we have to preserve the current
2479 and future entries at once).
2480 If we used hash trees rather than one big hash table, we might only have
2481 to rewrite some sections of the hash, too.
2484 \begin_layout Standard
2485 We could then implement snapshots using a similar method, using multiple
2486 different hash tables/free tables.
2489 \begin_layout Subsubsection
2493 \begin_layout Standard
2497 \begin_layout Subsection
2498 Transactions Cannot Operate in Parallel
2501 \begin_layout Standard
2502 This would be useless for ldb, as it hits the index records with just about
2504 It would add significant complexity in resolving clashes, and cause the
2505 all transaction callers to write their code to loop in the case where the
2506 transactions spuriously failed.
2509 \begin_layout Subsubsection
2513 \begin_layout Standard
2515 \begin_inset CommandInset ref
2517 reference "replay-attribute"
2522 We could solve a small part of the problem by providing read-only transactions.
2523 These would allow one write transaction to begin, but it could not commit
2524 until all r/o transactions are done.
2525 This would require a new RO_TRANSACTION_LOCK, which would be upgraded on
2529 \begin_layout Subsubsection
2533 \begin_layout Standard
2537 \begin_layout Subsection
2538 Default Hash Function Is Suboptimal
2541 \begin_layout Standard
2542 The Knuth-inspired multiplicative hash used by tdb is fairly slow (especially
2543 if we expand it to 64 bits), and works best when the hash bucket size is
2544 a prime number (which also means a slow modulus).
2545 In addition, it is highly predictable which could potentially lead to a
2546 Denial of Service attack in some TDB uses.
2549 \begin_layout Subsubsection
2553 \begin_layout Standard
2554 The Jenkins lookup3 hash
2558 \begin_layout Plain Layout
2559 http://burtleburtle.net/bob/c/lookup3.c
2564 is a fast and superbly-mixing hash.
2565 It's used by the Linux kernel and almost everything else.
2566 This has the particular properties that it takes an initial seed, and produces
2567 two 32 bit hash numbers, which we can combine into a 64-bit hash.
2570 \begin_layout Standard
2571 The seed should be created at tdb-creation time from some random source,
2572 and placed in the header.
2573 This is far from foolproof, but adds a little bit of protection against
2577 \begin_layout Subsubsection
2581 \begin_layout Standard
2585 \begin_layout Subsection
2586 \begin_inset CommandInset label
2588 name "Reliable-Traversal-Adds"
2592 Reliable Traversal Adds Complexity
2595 \begin_layout Standard
2596 We lock a record during traversal iteration, and try to grab that lock in
2598 If that grab on delete fails, we simply mark it deleted and continue onwards;
2599 traversal checks for this condition and does the delete when it moves off
2603 \begin_layout Standard
2604 If traversal terminates, the dead record may be left indefinitely.
2607 \begin_layout Subsubsection
2611 \begin_layout Standard
2612 Remove reliability guarantees; see
2613 \begin_inset CommandInset ref
2615 reference "traverse-Proposed-Solution"
2622 \begin_layout Subsubsection
2626 \begin_layout Standard
2630 \begin_layout Subsection
2631 Fcntl Locking Adds Overhead
2634 \begin_layout Standard
2635 Placing a fcntl lock means a system call, as does removing one.
2636 This is actually one reason why transactions can be faster (everything
2637 is locked once at transaction start).
2638 In the uncontended case, this overhead can theoretically be eliminated.
2641 \begin_layout Subsubsection
2645 \begin_layout Standard
2649 \begin_layout Standard
2650 We tried this before with spinlock support, in the early days of TDB, and
2651 it didn't make much difference except in manufactured benchmarks.
2654 \begin_layout Standard
2655 We could use spinlocks (with futex kernel support under Linux), but it means
2656 that we lose automatic cleanup when a process dies with a lock.
2657 There is a method of auto-cleanup under Linux, but it's not supported by
2658 other operating systems.
2659 We could reintroduce a clear-if-first-style lock and sweep for dead futexes
2660 on open, but that wouldn't help the normal case of one concurrent opener
2662 Increasingly elaborate repair schemes could be considered, but they require
2663 an ABI change (everyone must use them) anyway, so there's no need to do
2664 this at the same time as everything else.
2667 \begin_layout Subsection
2668 Some Transactions Don't Require Durability
2671 \begin_layout Standard
2672 Volker points out that gencache uses a CLEAR_IF_FIRST tdb for normal (fast)
2673 usage, and occasionally empties the results into a transactional TDB.
2674 This kind of usage prioritizes performance over durability: as long as
2675 we are consistent, data can be lost.
2678 \begin_layout Standard
2679 This would be more neatly implemented inside tdb: a
2680 \begin_inset Quotes eld
2684 \begin_inset Quotes erd
2687 transaction commit (ie.
2688 syncless) which meant that data may be reverted on a crash.
2691 \begin_layout Subsubsection
2695 \begin_layout Standard
2699 \begin_layout Standard
2700 Unfortunately any transaction scheme which overwrites old data requires
2701 a sync before that overwrite to avoid the possibility of corruption.
2704 \begin_layout Standard
2705 It seems possible to use a scheme similar to that described in
2706 \begin_inset CommandInset ref
2708 reference "sub:TDB-Does-Not"
2712 ,where transactions are committed without overwriting existing data, and
2713 an array of top-level pointers were available in the header.
2714 If the transaction is
2715 \begin_inset Quotes eld
2719 \begin_inset Quotes erd
2722 then we would not need a sync at all: existing processes would pick up
2723 the new hash table and free list and work with that.
2726 \begin_layout Standard
2727 At some later point, a sync would allow recovery of the old data into the
2728 free lists (perhaps when the array of top-level pointers filled).
2729 On crash, tdb_open() would examine the array of top levels, and apply the
2730 transactions until it encountered an invalid checksum.
2733 \begin_layout Subsection
2734 Tracing Is Fragile, Replay Is External
2737 \begin_layout Standard
2738 The current TDB has compile-time-enabled tracing code, but it often breaks
2739 as it is not enabled by default.
2740 In a similar way, the ctdb code has an external wrapper which does replay
2741 tracing so it can coordinate cluster-wide transactions.
2744 \begin_layout Subsubsection
2746 \begin_inset CommandInset label
2748 name "replay-attribute"
2755 \begin_layout Standard
2756 Tridge points out that an attribute can be later added to tdb_open (see
2758 \begin_inset CommandInset ref
2760 reference "attributes"
2764 ) to provide replay/trace hooks, which could become the basis for this and
2765 future parallel transactions and snapshot support.
2768 \begin_layout Subsubsection
2772 \begin_layout Standard
2783 @Add status, some fixes, linked freelists.
2789 \change_deleted 0 1291204535
2791 \change_inserted 0 1291204533
2796 \change_inserted 0 1291204563
2800 \change_inserted 0 1291204572
2803 \change_inserted 0 1291204573
2808 \change_inserted 0 1291204588
2812 \change_inserted 0 1291204588
2815 \change_inserted 0 1291204631
2820 \change_inserted 0 1291204639
2824 \change_inserted 0 1291204640
2827 \change_inserted 0 1291204665
2832 \change_inserted 0 1291204671
2836 \change_inserted 0 1291204671
2839 \change_inserted 0 1291204673
2844 \change_inserted 0 1291204731
2848 \change_inserted 0 1291204732
2851 \change_inserted 0 1291204779
2856 \change_inserted 0 1291204830
2860 \change_inserted 0 1291204831
2863 \change_inserted 0 1291204834
2868 \change_inserted 0 1291204847
2872 \change_inserted 0 1291204847
2875 \change_inserted 0 1291204852
2880 \change_inserted 0 1291204881
2884 \change_inserted 0 1291204881
2887 \change_inserted 0 1291204885
2892 \change_inserted 0 1291204898
2896 \change_inserted 0 1291204898
2899 \change_inserted 0 1291204901
2904 \change_inserted 0 1291204908
2908 \change_inserted 0 1291204908
2911 \change_inserted 0 1291204908
2916 \change_inserted 0 1291204917
2920 \change_inserted 0 1291204917
2923 \change_inserted 0 1291204920
2928 \change_inserted 0 1291204927
2932 \change_inserted 0 1291204928
2935 \change_inserted 0 1291204942
2940 \change_inserted 0 1291205003
2944 \change_inserted 0 1291205004
2947 \change_inserted 0 1291205007
2949 \change_inserted 0 1291205019
2953 \change_inserted 0 1291205019
2956 \change_inserted 0 1291205023
2961 \change_inserted 0 1291205029
2965 \change_inserted 0 1291205029
2968 \change_inserted 0 1291206020
2973 \change_inserted 0 1291205043
2977 \change_inserted 0 1291205043
2980 \change_inserted 0 1291205057
2985 \change_inserted 0 1291205062
2989 \change_inserted 0 1291205062
2992 \change_inserted 0 1291205062
2997 \change_inserted 0 1291205072
3001 \change_inserted 0 1291205073
3004 \change_inserted 0 1291205073
3010 \change_deleted 0 1291204504
3014 \change_inserted 0 1291205079
3018 \change_inserted 0 1291205080
3021 \change_inserted 0 1291205080
3026 \change_inserted 0 1291205090
3030 is to divide the file into zones, and using a free list (or
3031 \change_inserted 0 1291205498
3033 \change_deleted 0 1291205497
3036 of free lists) for each.
3038 \change_inserted 0 1291205203
3042 \change_inserted 0 1291205358
3048 \begin_layout Standard
3050 \change_deleted 0 1291205198
3051 Note that this means we need to split the free lists when we expand the
3052 file; this is probably acceptable when we double the hash table size, since
3053 that is such an expensive operation already.
3054 In the case of increasing the file size, there is an optimization we can
3055 use: if we use M in the formula above as the file size rounded up to the
3056 next power of 2, we only need reshuffle free lists when the file size crosses
3057 a power of 2 boundary,
3061 reshuffling the free lists is trivial: we simply merge every consecutive
3067 Identify the correct
3068 \change_inserted 0 1291205366
3070 \change_deleted 0 1291205364
3077 \change_inserted 0 1291205372
3079 \change_deleted 0 1291205371
3082 (we didn't have a lock, sizes could have changed): relock if necessary.
3085 Place the freed entry in the list
3086 \change_deleted 0 1291205382
3093 \change_deleted 0 1291205403
3094 zone either the zone we last freed into, or based on a
3095 \begin_inset Quotes eld
3099 \begin_inset Quotes erd
3103 \change_inserted 0 1291205411
3104 free table; usually the previous one.
3108 \change_deleted 0 1291205432
3112 \begin_layout Enumerate
3114 \change_deleted 0 1291205428
3115 Re-check the zone: relock if necessary.
3120 unlock the list and try the next
3121 \change_inserted 0 1291205455
3123 \change_deleted 0 1291205452
3125 \change_inserted 0 1291205457
3129 \change_inserted 0 1291205476
3134 \change_inserted 0 1291205542
3138 \change_inserted 0 1291205591
3144 \begin_layout Standard
3146 \change_deleted 0 1291205539
3147 I anticipate that the number of entries in each free zone would be small,
3148 but it might be worth using one free entry to hold pointers to the others
3149 for cache efficiency.
3154 \begin_layout Standard
3156 \change_deleted 0 1291205534
3157 \begin_inset CommandInset label
3159 name "freelist-in-zone"
3163 If we want to avoid locking complexity (enlarging the free lists when we
3164 enlarge the file) we could place the array of free lists at the beginning
3166 This means existing array lists never move, but means that a record cannot
3167 be larger than a zone.
3168 That in turn implies that zones should be variable sized (say, power of
3169 2), which makes the question
3170 \begin_inset Quotes eld
3173 what zone is this record in?
3174 \begin_inset Quotes erd
3178 \begin_inset Quotes eld
3182 \begin_inset Quotes erd
3185 , but that's less common).
3186 It could be done with as few as 4 bits from the record header.
3190 \begin_layout Plain Layout
3192 \begin_inset Formula $2^{16+N*3}$
3195 means 0 gives a minimal 65536-byte zone, 15 gives the maximal
3196 \begin_inset Formula $2^{61}$
3200 Zones range in factor of 8 steps.
3201 Given the zone size for the zone the current record is in, we can determine
3202 the start of the zone.
3208 \change_inserted 0 1291205139
3213 \change_inserted 0 1291205758
3218 \change_deleted 0 1291205693
3225 \change_inserted 0 1291205704
3227 \change_deleted 0 1291205704
3234 \change_inserted 0 1291205725
3236 \change_deleted 0 1291205723
3240 \change_inserted 0 1291205753
3244 \change_inserted 0 1291205746
3246 \change_deleted 0 1291205749
3252 \begin_layout LyX-Code
3254 \change_inserted 0 1291205786
3258 \begin_layout LyX-Code
3260 \change_inserted 0 1291205788
3264 \change_inserted 0 1291205792
3266 \change_deleted 0 1291205790
3273 \change_deleted 0 1291205801
3277 \change_deleted 0 1291205811
3282 \change_deleted 0 1291205811
3287 \change_deleted 0 1291205808
3293 \change_deleted 0 1291205827
3294 We might want to take some bits from the used record's top_hash (and the
3295 free record which has 32 bits of padding to spare anyway) if we use variable
3298 \begin_inset CommandInset ref
3300 reference "freelist-in-zone"
3306 \change_inserted 0 1291205885
3307 Note that by limiting valid offsets to 56 bits, we can pack everything
3308 we need into 3 64-byte words, meaning our minimum record size is 8 bytes.
3311 \change_inserted 0 1291205886
3314 \change_inserted 0 1291205886
3319 \change_inserted 0 1291205894
3323 \change_inserted 0 1291205894
3326 \change_inserted 0 1291205902
3332 \change_deleted 0 1291204504
3336 \change_inserted 0 1291205910
3340 \change_inserted 0 1291205910
3343 \change_inserted 0 1291205914
3348 \change_inserted 0 1291205919
3352 \change_inserted 0 1291205919
3355 \change_inserted 0 1291205922
3360 \change_inserted 0 1291205929
3364 \change_inserted 0 1291205929
3367 \change_inserted 0 1291205929
3372 \change_inserted 0 1291205932
3376 \change_inserted 0 1291205933
3379 \change_inserted 0 1291205933
3384 \change_inserted 0 1291205944
3388 \change_inserted 0 1291205945
3391 \change_inserted 0 1291205948
3420 if older code (which doesn't understand the feature) writes to the database.Reco
3421 rd Headers Are Not Expandible
3430 Proposed SolutionThe first step is to remove all the current heuristics,
3431 as they obviously interact, then examine them once the lock contention
3435 is to divide the file into zones, and using a free list (or set of free
3442 Identify the correct zone.
3445 Re-check the zone (we didn't have a lock, sizes could have changed): relock
3449 Place the freed entry in the list for that zone.
3452 Pick a zone either the zone we last freed into, or based on a
3459 unlock the list and try the next zone.
3470 uint32_t magic : 16,
3478 uint32_t free_magic;
3481 uint64_t total_length;
3484 uint64_t prev, next;
3502 @Tracing attribute, talloc support.
3507 #LyX 1.6.5 created this file. For more info see http://www.lyx.org/
3511 \change_deleted 0 1283307542
3513 \change_inserted 0 1284423485
3518 \change_inserted 0 1284422789
3525 \change_inserted 0 1284016998
3530 \change_inserted 0 1284015637
3534 \change_inserted 0 1284015716
3537 \change_inserted 0 1284015906
3540 \change_inserted 0 1284015637
3543 \change_inserted 0 1284016114
3546 \change_inserted 0 1284016149
3549 \change_inserted 0 1284016639
3552 \change_inserted 0 1284016821
3555 \change_inserted 0 1284016803
3558 if older code (which doesn't understand the feature) writes to the database.
3559 \change_deleted 0 1284016101
3563 \begin_layout Subsection
3565 \change_inserted 0 1284015634
3566 Record Headers Are Not Expandible
3569 \change_inserted 0 1284015634
3572 \change_inserted 0 1284015634
3575 \change_inserted 0 1284422552
3578 \change_inserted 0 1284422568
3581 \change_inserted 0 1284422646
3584 \change_inserted 0 1284422656
3587 \change_inserted 0 1284423065
3590 \change_inserted 0 1284423042
3596 \change_inserted 0 1283336713
3603 \change_deleted 0 1283307675
3604 There are three details which become important:
3607 \begin_layout Enumerate
3609 \change_deleted 0 1283307675
3610 On encountering a full bucket, we use the next bucket.
3613 \begin_layout Enumerate
3615 \change_deleted 0 1283307675
3616 Extra hash bits are stored with the offset, to reduce comparisons.
3619 \begin_layout Enumerate
3621 \change_deleted 0 1283307675
3622 A marker entry is used on deleting an entry.
3625 \begin_layout Standard
3627 \change_deleted 0 1283307675
3628 The doubling of the table must be done under a transaction; we will not
3629 reduce it on deletion, so it will be an unusual case.
3630 It will either be placed at the head (other entries will be moved out the
3631 way so we can expand).
3632 We could have a pointer in the header to the current hashtable location,
3633 but that pointer would have to be read frequently to check for hashtable
3637 \begin_layout Standard
3639 \change_deleted 0 1283307675
3640 The locking for this is slightly more complex than the chained case; we
3641 currently have one lock per bucket, and that means we would need to expand
3642 the lock if we overflow to the next bucket.
3643 The frequency of such collisions will effect our locking heuristics: we
3644 can always lock more buckets than we need.
3647 \begin_layout Standard
3649 \change_deleted 0 1283307675
3650 One possible optimization is to only re-check the hash size on an insert
3653 \change_inserted 0 1283307770
3656 \change_inserted 0 1283336187
3659 \change_inserted 0 1283336586
3666 \change_deleted 0 1283336858
3670 \begin_layout Standard
3671 The first step is to remove all the current heuristics, as they obviously
3672 interact, then examine them once the lock contention is addressed.
3674 \change_inserted 0 1283336910
3678 \change_inserted 0 1283337052
3683 \change_inserted 0 1283309850
3687 \change_inserted 0 1283337216
3690 \change_inserted 0 1284424151
3699 \change_inserted 0 1283336739
3704 \change_inserted 0 1283337133
3708 \change_inserted 0 1283337139
3714 \change_inserted 0 1283337235
3721 \change_deleted 0 1284423472
3725 \begin_layout Standard
3729 \change_inserted 0 1284423891
3732 \change_deleted 0 1284423891
3735 \change_inserted 0 1284423901
3740 \change_inserted 0 1284423495
3744 \change_inserted 0 1284424201
3749 We could solve a small part of the problem by providing read-only transactions.
3751 \change_inserted 0 1284423555
3755 \change_inserted 0 1284423617
3758 \change_inserted 0 1284423719
3761 \change_inserted 0 1284423864
3764 \change_inserted 0 1284423850
3773 @Extension mechanism.
3778 \change_inserted 0 1284016854
3783 \change_inserted 0 1284016847
3787 \change_inserted 0 1283310945
3800 @Remove bogus footnote
3805 \change_inserted 0 1283307544
3814 @Moving hash table does not work.
3821 \begin_layout Plain Layout
3823 \change_inserted 0 1283336450
3824 If we make the hash offsets zone-relative, then this only restricts the
3825 zone size, not the overall database size.
3847 There are three details which become important:
3862 \begin_layout LyX-Code
3868 @Soft transaction commit
3873 \author "Rusty Russell,,,"
3876 \change_deleted 0 1280141199
3878 \change_inserted 0 1280141202
3884 \change_inserted 0 1280140902
3889 \change_inserted 0 1280140661
3893 \change_inserted 0 1280140703
3896 \change_inserted 0 1280708312
3899 \change_inserted 0 1280708400
3902 \change_inserted 0 1280140836
3905 \change_inserted 0 1280708255
3908 \change_inserted 0 1280708374
3911 \change_inserted 0 1280141181
3914 \change_inserted 0 1280141345
3935 @Transaction and freelist rethink.
3940 \author "Rusty Russell,,,"
3946 behavior of disallowing
3947 \change_inserted 0 1272940179
3950 transactions should become the default.
3952 \change_inserted 0 1272944650
3956 \change_inserted 0 1272944763
3965 \change_inserted 0 1273478114
3972 \change_deleted 0 1273469807
3974 \change_inserted 0 1273469810
3978 \change_deleted 0 1273469815
3981 to reduce contention.
3983 \change_inserted 0 1273470006
3987 \change_inserted 0 1273492055
3990 \change_inserted 0 1273483888
3996 \change_deleted 0 1272942055
3997 There are various ways to organize these lisys, but because we want to be
3998 able to quickly identify which free list an entry is in, and reduce the
3999 number of locks required for merging, we will use zoning (eg.
4000 each free list covers some fixed fraction of the file).
4002 \change_inserted 0 1273484187
4006 \change_deleted 0 1273484194
4008 \change_inserted 0 1273484194
4014 Identify the correct
4015 \change_deleted 0 1273482856
4017 \change_inserted 0 1273482857
4024 \change_inserted 0 1273482895
4028 \change_inserted 0 1273482863
4032 \change_inserted 0 1273482909
4036 \change_deleted 0 1273482885
4038 \change_inserted 0 1273482888
4041 lace the freed entry
4042 \change_deleted 0 1273492415
4044 \change_inserted 0 1273492415
4045 in the list for that zone
4050 Allocation is a little more complicated, as we
4051 \change_deleted 0 1273483240
4052 merge entries as we walk the list:
4053 \change_inserted 0 1273484250
4054 perform delayed coalescing at this point:
4060 \change_deleted 0 1273482955
4062 \change_inserted 0 1273482957
4066 \change_deleted 0 1273482962
4068 \change_inserted 0 1273482962
4072 \change_deleted 0 1273482966
4074 \change_inserted 0 1273482966
4081 \change_inserted 0 1273482980
4083 \change_deleted 0 1273482973
4087 \change_inserted 0 1273482982
4091 \change_inserted 0 1273483084
4097 \begin_layout Enumerate
4099 \change_deleted 0 1273492155
4101 \change_inserted 0 1273492159
4104 remove it from the list and return it.
4107 \begin_layout Enumerate
4109 \change_inserted 0 1273492206
4110 coalesce entries in the list.
4111 \change_deleted 0 1273492200
4112 examine the entry to the right of it in the file.
4117 \begin_layout Enumerate
4119 \change_deleted 0 1273492200
4120 If that entry is in a different list, lock that list too.
4123 \begin_layout Enumerate
4125 \change_deleted 0 1273492200
4126 If we had to place a new lock, re-check that the entry is free.
4129 \begin_layout Enumerate
4131 \change_deleted 0 1273492200
4132 Remove that entry from its free list and expand this entry to cover it.
4135 \begin_layout Enumerate
4137 \change_deleted 0 1273485554
4142 \begin_layout Enumerate
4144 \change_inserted 0 1273485311
4145 If there was no entry large enough, unlock the list and try the next zone.
4149 \change_deleted 0 1273483646
4150 Repeat step 3 with each entry in the list.
4156 \change_deleted 0 1273483668
4157 Unlock the list and repeat step 2 with the next list.
4163 \change_deleted 0 1273483671
4165 \change_inserted 0 1273483671
4168 satisfies, expand the file.
4171 This optimizes rapid insert/delete of free list entries
4172 \change_inserted 0 1273485794
4173 by not coalescing them all the time.
4174 \change_deleted 0 1273483685
4175 , and allows us to get rid of the tailer altogether
4179 \change_inserted 0 1273492299
4182 \change_deleted 0 1273476840
4184 \begin_inset Quotes eld
4188 \begin_inset Quotes erd
4191 free entries is more difficult: the 25% overhead works in practice for
4192 ldb because indexes tend to expand by one record at a time.
4193 This can be resolved by having an
4194 \begin_inset Quotes eld
4198 \begin_inset Quotes erd
4201 bit in the header to note entries that have previously expanded, and allocating
4202 more space for them.
4204 \begin_inset Quotes eld
4208 \begin_inset Quotes erd
4211 algorithm should be implemented or first-fit used is still unknown: we
4212 will determine this once these other ideas are implemented.
4213 \change_inserted 0 1273483750
4217 \begin_layout Standard
4219 \change_inserted 0 1273492450
4222 \change_inserted 0 1273470441
4225 \change_inserted 0 1273476556
4228 \change_inserted 0 1273470423
4234 \change_inserted 0 1273476847
4237 \change_inserted 0 1273476886
4240 \change_inserted 0 1273477233
4243 \change_inserted 0 1273477534
4246 \change_inserted 0 1273482700
4249 \change_inserted 0 1273478079
4252 \change_inserted 0 1273477839
4255 \change_inserted 0 1273477925
4258 \change_inserted 0 1273477925
4261 \change_inserted 0 1273477925
4264 \change_inserted 0 1273477925
4267 \change_inserted 0 1273477925
4270 \change_inserted 0 1273477925
4273 \change_inserted 0 1273477925
4276 \change_inserted 0 1273477925
4279 \change_inserted 0 1273477925
4282 \change_inserted 0 1273477925
4285 \change_inserted 0 1273477925
4288 \change_inserted 0 1273477925
4291 \change_inserted 0 1273477925
4294 \change_inserted 0 1273477925
4297 \change_inserted 0 1273477925
4300 \change_inserted 0 1273477925
4303 \change_inserted 0 1273477925
4306 \change_inserted 0 1273477925
4309 \change_inserted 0 1273492522
4312 \change_inserted 0 1273492530
4315 \change_inserted 0 1273492546
4318 \change_inserted 0 1273478239
4321 \change_inserted 0 1273479960
4324 \change_inserted 0 1273480265
4327 \change_inserted 0 1273480354
4330 \change_inserted 0 1273478968
4333 \change_inserted 0 1273492604
4336 \change_inserted 0 1273479572
4342 \change_inserted 0 1273480282
4345 \change_inserted 0 1273478931
4348 \change_inserted 0 1273481549
4351 \change_inserted 0 1273481557
4354 \change_inserted 0 1273480307
4357 \change_inserted 0 1273480335
4360 \change_inserted 0 1273479897
4363 \change_inserted 0 1273479653
4366 \change_inserted 0 1273480371
4369 \change_inserted 0 1273480464
4372 \change_inserted 0 1273480399
4375 \change_inserted 0 1273480425
4378 \change_inserted 0 1273480453
4381 \change_inserted 0 1273480455
4384 \change_inserted 0 1273480450
4387 \change_inserted 0 1273480452
4389 \change_inserted 0 1273478830
4393 \change_deleted 0 1273481604
4394 In theory, we could get away with 2: one after we write the new data, and
4395 one to somehow atomically change over to it.
4396 \change_inserted 0 1273481632
4399 \change_inserted 0 1273481724
4402 \change_inserted 0 1273481713
4405 \change_inserted 0 1273481717
4408 \change_inserted 0 1273481730
4411 \change_inserted 0 1273481736
4414 \change_inserted 0 1273481744
4417 \change_inserted 0 1273481748
4420 \change_inserted 0 1273482185
4423 \change_inserted 0 1273482259
4426 \change_deleted 0 1273481848
4428 Trying to rewrite the transaction code is a separate experiment, which
4429 I encourage someone else to do.
4430 At some point you say
4431 \begin_inset Quotes eld
4435 \begin_inset Quotes erd
4441 \begin_layout Standard
4443 \change_deleted 0 1273481848
4444 But as a thought experiment:
4449 \begin_layout Standard
4451 \change_deleted 0 1273481788
4452 Say there was a pointer in the header which said where the hash table and
4453 free list tables were, and that no blocks were labeled with whether they
4454 were free or not (it had to be derived from what list they were in).
4455 We could create new hash table and free list in some free space, and populate
4456 it as we want the post-committed state to look.
4457 Then we sync, then we switch the offset in the header, then we sync again.
4460 \begin_layout Standard
4462 \change_deleted 0 1273481788
4463 This would not allow arbitrary changes to the database, such as tdb_repack
4464 does, and would require more space (since we have to preserve the current
4465 and future entries at once).
4466 If we used hash trees rather than one big hash table, we might only have
4467 to rewrite some sections of the hash, too.
4468 \change_inserted 0 1273481854
4472 \begin_layout Standard
4474 \change_inserted 0 1273482102
4477 \change_inserted 0 1273482061
4480 \change_inserted 0 1273482063
4483 \change_inserted 0 1273482072
4486 \change_inserted 0 1273482139
4489 \change_inserted 0 1273482364
4492 \change_inserted 0 1273482163
4495 \change_inserted 0 1273482493
4498 \change_inserted 0 1273482536
4504 \change_inserted 0 1273482641
4507 \change_inserted 0 1273481827
4511 \change_inserted 0 1273481829
4514 implement snapshots using a similar method
4515 \change_deleted 0 1273481838
4517 \change_inserted 0 1273481840
4520 using multiple different hash tables/free tables.
4526 @After first feedback (Ronnie & Volker)
4532 The free list should be split into multiple lists to reduce contention.
4537 The algorithm for freeing is simple:
4540 Identify the correct free list.
4543 Lock the list, and place the freed entry at the head.
4546 Allocation is a little more complicated, as we merge entries as we walk
4550 Pick a free list; either the list we last freed onto, or based on a
4556 If the top entry is well-sized, remove it from the list and return it.
4559 Otherwise, examine the entry to the right of it in the file.
4570 If no list satisfies, expand the file.
4573 This optimizes rapid insert/delete of free list entries, and allows us to
4574 get rid of the tailer altogether.
4578 \change_inserted 0 1272941474
4581 \change_inserted 0 1272942759
4582 There are various ways to organize these lists, but because we want to be
4583 able to quickly identify which free list an entry is in, and reduce the
4584 number of locks required for merging, we will use zoning (eg.
4585 each of the N free lists in a tdb file of size M covers a fixed fraction
4587 Note that this means we need to reshuffle the free lists when we expand
4588 the file; this is probably acceptable when we double the hash table size,
4589 since that is such an expensive operation already.
4590 In the case of increasing the file size, there is an optimization we can
4591 use: if we use M in the formula above as the file size rounded up to the
4592 next power of 2, we only need reshuffle free lists when the file size crosses
4593 a power of 2 boundary,
4597 reshuffling the free lists is trivial: we simply merge every consecutive
4611 We could implement snapshots using a similar method to the above, only using
4612 multiple different hash tables/free tables.
4623 #LyX 1.6.4 created this file. For more info see http://www.lyx.org/
4626 \tracking_changes false
4627 \output_changes false
4631 behavior of disallowing transactions should become the default.
4636 The algorithm for freeing is simple: