git.ozlabs.org Git - ccan/blob - ccan/tdb2/doc/design.lyx

   1 #LyX 1.6.5 created this file. For more info see http://www.lyx.org/
   2 \lyxformat 345
   3 \begin_document
   4 \begin_header
   5 \textclass article
   6 \use_default_options true
   7 \language english
   8 \inputencoding auto
   9 \font_roman default
  10 \font_sans default
  11 \font_typewriter default
  12 \font_default_family default
  13 \font_sc false
  14 \font_osf false
  15 \font_sf_scale 100
  16 \font_tt_scale 100
  17
  18 \graphics default
  19 \paperfontsize default
  20 \use_hyperref false
  21 \papersize default
  22 \use_geometry false
  23 \use_amsmath 1
  24 \use_esint 1
  25 \cite_engine basic
  26 \use_bibtopic false
  27 \paperorientation portrait
  28 \secnumdepth 3
  29 \tocdepth 3
  30 \paragraph_separation indent
  31 \defskip medskip
  32 \quotes_language english
  33 \papercolumns 1
  34 \papersides 1
  35 \paperpagestyle default
  36 \tracking_changes true
  37 \output_changes true
  38 \author "Rusty Russell,,,"
  39 \author ""
  40 \end_header
  41
  42 \begin_body
  43
  44 \begin_layout Title
  45 TDB2: A Redesigning The Trivial DataBase
  46 \end_layout
  47
  48 \begin_layout Author
  49 Rusty Russell, IBM Corporation
  50 \end_layout
  51
  52 \begin_layout Date
  53
  54 \change_deleted 0 1283307542
  55 26-July
  56 \change_inserted 0 1284423485
  57 14-September
  58 \change_unchanged
  59 -2010
  60 \end_layout
  61
  62 \begin_layout Abstract
  63 The Trivial DataBase on-disk format is 32 bits; with usage cases heading
  64  towards the 4G limit, that must change.
  65  This required breakage provides an opportunity to revisit TDB's other design
  66  decisions and reassess them.
  67 \end_layout
  68
  69 \begin_layout Section
  70 Introduction
  71 \end_layout
  72
  73 \begin_layout Standard
  74 The Trivial DataBase was originally written by Andrew Tridgell as a simple
  75  key/data pair storage system with the same API as dbm, but allowing multiple
  76  readers and writers while being small enough (< 1000 lines of C) to include
  77  in SAMBA.
  78  The simple design created in 1999 has proven surprisingly robust and performant
  79 , used in Samba versions 3 and 4 as well as numerous other projects.
  80  Its useful life was greatly increased by the (backwards-compatible!) addition
  81  of transaction support in 2005.
  82 \end_layout
  83
  84 \begin_layout Standard
  85 The wider variety and greater demands of TDB-using code has lead to some
  86  organic growth of the API, as well as some compromises on the implementation.
  87  None of these, by themselves, are seen as show-stoppers, but the cumulative
  88  effect is to a loss of elegance over the initial, simple TDB implementation.
  89  Here is a table of the approximate number of lines of implementation code
  90  and number of API functions at the end of each year:
  91 \end_layout
  92
  93 \begin_layout Standard
  94 \begin_inset Tabular
  95 <lyxtabular version="3" rows="12" columns="3">
  96 <features>
  97 <column alignment="center" valignment="top" width="0">
  98 <column alignment="center" valignment="top" width="0">
  99 <column alignment="center" valignment="top" width="0">
 100 <row>
 101 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
 102 \begin_inset Text
 103
 104 \begin_layout Plain Layout
 105 Year End
 106 \end_layout
 107
 108 \end_inset
 109 </cell>
 110 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
 111 \begin_inset Text
 112
 113 \begin_layout Plain Layout
 114 API Functions
 115 \end_layout
 116
 117 \end_inset
 118 </cell>
 119 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
 120 \begin_inset Text
 121
 122 \begin_layout Plain Layout
 123 Lines of C Code Implementation
 124 \end_layout
 125
 126 \end_inset
 127 </cell>
 128 </row>
 129 <row>
 130 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 131 \begin_inset Text
 132
 133 \begin_layout Plain Layout
 134 1999
 135 \end_layout
 136
 137 \end_inset
 138 </cell>
 139 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 140 \begin_inset Text
 141
 142 \begin_layout Plain Layout
 143 13
 144 \end_layout
 145
 146 \end_inset
 147 </cell>
 148 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 149 \begin_inset Text
 150
 151 \begin_layout Plain Layout
 152 1195
 153 \end_layout
 154
 155 \end_inset
 156 </cell>
 157 </row>
 158 <row>
 159 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 160 \begin_inset Text
 161
 162 \begin_layout Plain Layout
 163 2000
 164 \end_layout
 165
 166 \end_inset
 167 </cell>
 168 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 169 \begin_inset Text
 170
 171 \begin_layout Plain Layout
 172 24
 173 \end_layout
 174
 175 \end_inset
 176 </cell>
 177 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 178 \begin_inset Text
 179
 180 \begin_layout Plain Layout
 181 1725
 182 \end_layout
 183
 184 \end_inset
 185 </cell>
 186 </row>
 187 <row>
 188 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 189 \begin_inset Text
 190
 191 \begin_layout Plain Layout
 192 2001
 193 \end_layout
 194
 195 \end_inset
 196 </cell>
 197 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 198 \begin_inset Text
 199
 200 \begin_layout Plain Layout
 201 32
 202 \end_layout
 203
 204 \end_inset
 205 </cell>
 206 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 207 \begin_inset Text
 208
 209 \begin_layout Plain Layout
 210 2228
 211 \end_layout
 212
 213 \end_inset
 214 </cell>
 215 </row>
 216 <row>
 217 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 218 \begin_inset Text
 219
 220 \begin_layout Plain Layout
 221 2002
 222 \end_layout
 223
 224 \end_inset
 225 </cell>
 226 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 227 \begin_inset Text
 228
 229 \begin_layout Plain Layout
 230 35
 231 \end_layout
 232
 233 \end_inset
 234 </cell>
 235 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 236 \begin_inset Text
 237
 238 \begin_layout Plain Layout
 239 2481
 240 \end_layout
 241
 242 \end_inset
 243 </cell>
 244 </row>
 245 <row>
 246 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 247 \begin_inset Text
 248
 249 \begin_layout Plain Layout
 250 2003
 251 \end_layout
 252
 253 \end_inset
 254 </cell>
 255 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 256 \begin_inset Text
 257
 258 \begin_layout Plain Layout
 259 35
 260 \end_layout
 261
 262 \end_inset
 263 </cell>
 264 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 265 \begin_inset Text
 266
 267 \begin_layout Plain Layout
 268 2552
 269 \end_layout
 270
 271 \end_inset
 272 </cell>
 273 </row>
 274 <row>
 275 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 276 \begin_inset Text
 277
 278 \begin_layout Plain Layout
 279 2004
 280 \end_layout
 281
 282 \end_inset
 283 </cell>
 284 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 285 \begin_inset Text
 286
 287 \begin_layout Plain Layout
 288 40
 289 \end_layout
 290
 291 \end_inset
 292 </cell>
 293 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 294 \begin_inset Text
 295
 296 \begin_layout Plain Layout
 297 2584
 298 \end_layout
 299
 300 \end_inset
 301 </cell>
 302 </row>
 303 <row>
 304 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 305 \begin_inset Text
 306
 307 \begin_layout Plain Layout
 308 2005
 309 \end_layout
 310
 311 \end_inset
 312 </cell>
 313 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 314 \begin_inset Text
 315
 316 \begin_layout Plain Layout
 317 38
 318 \end_layout
 319
 320 \end_inset
 321 </cell>
 322 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 323 \begin_inset Text
 324
 325 \begin_layout Plain Layout
 326 2647
 327 \end_layout
 328
 329 \end_inset
 330 </cell>
 331 </row>
 332 <row>
 333 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 334 \begin_inset Text
 335
 336 \begin_layout Plain Layout
 337 2006
 338 \end_layout
 339
 340 \end_inset
 341 </cell>
 342 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 343 \begin_inset Text
 344
 345 \begin_layout Plain Layout
 346 52
 347 \end_layout
 348
 349 \end_inset
 350 </cell>
 351 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 352 \begin_inset Text
 353
 354 \begin_layout Plain Layout
 355 3754
 356 \end_layout
 357
 358 \end_inset
 359 </cell>
 360 </row>
 361 <row>
 362 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 363 \begin_inset Text
 364
 365 \begin_layout Plain Layout
 366 2007
 367 \end_layout
 368
 369 \end_inset
 370 </cell>
 371 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 372 \begin_inset Text
 373
 374 \begin_layout Plain Layout
 375 66
 376 \end_layout
 377
 378 \end_inset
 379 </cell>
 380 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 381 \begin_inset Text
 382
 383 \begin_layout Plain Layout
 384 4398
 385 \end_layout
 386
 387 \end_inset
 388 </cell>
 389 </row>
 390 <row>
 391 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 392 \begin_inset Text
 393
 394 \begin_layout Plain Layout
 395 2008
 396 \end_layout
 397
 398 \end_inset
 399 </cell>
 400 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
 401 \begin_inset Text
 402
 403 \begin_layout Plain Layout
 404 71
 405 \end_layout
 406
 407 \end_inset
 408 </cell>
 409 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
 410 \begin_inset Text
 411
 412 \begin_layout Plain Layout
 413 4768
 414 \end_layout
 415
 416 \end_inset
 417 </cell>
 418 </row>
 419 <row>
 420 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
 421 \begin_inset Text
 422
 423 \begin_layout Plain Layout
 424 2009
 425 \end_layout
 426
 427 \end_inset
 428 </cell>
 429 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
 430 \begin_inset Text
 431
 432 \begin_layout Plain Layout
 433 73
 434 \end_layout
 435
 436 \end_inset
 437 </cell>
 438 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
 439 \begin_inset Text
 440
 441 \begin_layout Plain Layout
 442 5715
 443 \end_layout
 444
 445 \end_inset
 446 </cell>
 447 </row>
 448 </lyxtabular>
 449
 450 \end_inset
 451
 452
 453 \end_layout
 454
 455 \begin_layout Standard
 456 This review is an attempt to catalog and address all the known issues with
 457  TDB and create solutions which address the problems without significantly
 458  increasing complexity; all involved are far too aware of the dangers of
 459  second system syndrome in rewriting a successful project like this.
 460 \end_layout
 461
 462 \begin_layout Section
 463 API Issues
 464 \end_layout
 465
 466 \begin_layout Subsection
 467 tdb_open_ex Is Not Expandable
 468 \end_layout
 469
 470 \begin_layout Standard
 471 The tdb_open() call was expanded to tdb_open_ex(), which added an optional
 472  hashing function and an optional logging function argument.
 473  Additional arguments to open would require the introduction of a tdb_open_ex2
 474  call etc.
 475 \end_layout
 476
 477 \begin_layout Subsubsection
 478 Proposed Solution
 479 \change_inserted 0 1284422789
 480
 481 \begin_inset CommandInset label
 482 LatexCommand label
 483 name "attributes"
 484
 485 \end_inset
 486
 487
 488 \change_unchanged
 489
 490 \end_layout
 491
 492 \begin_layout Standard
 493 tdb_open() will take a linked-list of attributes:
 494 \end_layout
 495
 496 \begin_layout LyX-Code
 497 enum tdb_attribute {
 498 \end_layout
 499
 500 \begin_layout LyX-Code
 501     TDB_ATTRIBUTE_LOG = 0,
 502 \end_layout
 503
 504 \begin_layout LyX-Code
 505     TDB_ATTRIBUTE_HASH = 1
 506 \end_layout
 507
 508 \begin_layout LyX-Code
 509 };
 510 \end_layout
 511
 512 \begin_layout LyX-Code
 513 struct tdb_attribute_base {
 514 \end_layout
 515
 516 \begin_layout LyX-Code
 517     enum tdb_attribute attr;
 518 \end_layout
 519
 520 \begin_layout LyX-Code
 521     union tdb_attribute *next;
 522 \end_layout
 523
 524 \begin_layout LyX-Code
 525 };
 526 \end_layout
 527
 528 \begin_layout LyX-Code
 529 struct tdb_attribute_log {
 530 \end_layout
 531
 532 \begin_layout LyX-Code
 533     struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG */
 534 \end_layout
 535
 536 \begin_layout LyX-Code
 537     tdb_log_func log_fn;
 538 \end_layout
 539
 540 \begin_layout LyX-Code
 541     void *log_private;
 542 \end_layout
 543
 544 \begin_layout LyX-Code
 545 };
 546 \end_layout
 547
 548 \begin_layout LyX-Code
 549 struct tdb_attribute_hash {
 550 \end_layout
 551
 552 \begin_layout LyX-Code
 553     struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH */
 554 \end_layout
 555
 556 \begin_layout LyX-Code
 557     tdb_hash_func hash_fn;
 558 \end_layout
 559
 560 \begin_layout LyX-Code
 561     void *hash_private;
 562 \end_layout
 563
 564 \begin_layout LyX-Code
 565 };
 566 \end_layout
 567
 568 \begin_layout LyX-Code
 569 union tdb_attribute {
 570 \end_layout
 571
 572 \begin_layout LyX-Code
 573     struct tdb_attribute_base base;
 574 \end_layout
 575
 576 \begin_layout LyX-Code
 577     struct tdb_attribute_log log;
 578 \end_layout
 579
 580 \begin_layout LyX-Code
 581     struct tdb_attribute_hash hash;
 582 \end_layout
 583
 584 \begin_layout LyX-Code
 585 };
 586 \end_layout
 587
 588 \begin_layout Standard
 589 This allows future attributes to be added, even if this expands the size
 590  of the union.
 591 \end_layout
 592
 593 \begin_layout Subsection
 594 tdb_traverse Makes Impossible Guarantees
 595 \end_layout
 596
 597 \begin_layout Standard
 598 tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, and it
 599  was thought that it was important to guarantee that all records which exist
 600  at the start and end of the traversal would be included, and no record
 601  would be included twice.
 602 \end_layout
 603
 604 \begin_layout Standard
 605 This adds complexity (see
 606 \begin_inset CommandInset ref
 607 LatexCommand ref
 608 reference "Reliable-Traversal-Adds"
 609
 610 \end_inset
 611
 612 ) and does not work anyway for records which are altered (in particular,
 613  those which are expanded may be effectively deleted and re-added behind
 614  the traversal).
 615 \end_layout
 616
 617 \begin_layout Subsubsection
 618 \begin_inset CommandInset label
 619 LatexCommand label
 620 name "traverse-Proposed-Solution"
 621
 622 \end_inset
 623
 624 Proposed Solution
 625 \end_layout
 626
 627 \begin_layout Standard
 628 Abandon the guarantee.
 629  You will see every record if no changes occur during your traversal, otherwise
 630  you will see some subset.
 631  You can prevent changes by using a transaction or the locking API.
 632 \end_layout
 633
 634 \begin_layout Subsection
 635 Nesting of Transactions Is Fraught
 636 \end_layout
 637
 638 \begin_layout Standard
 639 TDB has alternated between allowing nested transactions and not allowing
 640  them.
 641  Various paths in the Samba codebase assume that transactions will nest,
 642  and in a sense they can: the operation is only committed to disk when the
 643  outer transaction is committed.
 644  There are two problems, however:
 645 \end_layout
 646
 647 \begin_layout Enumerate
 648 Canceling the inner transaction will cause the outer transaction commit
 649  to fail, and will not undo any operations since the inner transaction began.
 650  This problem is soluble with some additional internal code.
 651 \end_layout
 652
 653 \begin_layout Enumerate
 654 An inner transaction commit can be cancelled by the outer transaction.
 655  This is desirable in the way which Samba's database initialization code
 656  uses transactions, but could be a surprise to any users expecting a successful
 657  transaction commit to expose changes to others.
 658 \end_layout
 659
 660 \begin_layout Standard
 661 The current solution is to specify the behavior at tdb_open(), with the
 662  default currently that nested transactions are allowed.
 663  This flag can also be changed at runtime.
 664 \end_layout
 665
 666 \begin_layout Subsubsection
 667 Proposed Solution
 668 \end_layout
 669
 670 \begin_layout Standard
 671 Given the usage patterns, it seems that the
 672 \begin_inset Quotes eld
 673 \end_inset
 674
 675 least-surprise
 676 \begin_inset Quotes erd
 677 \end_inset
 678
 679  behavior of disallowing nested transactions should become the default.
 680  Additionally, it seems the outer transaction is the only code which knows
 681  whether inner transactions should be allowed, so a flag to indicate this
 682  could be added to tdb_transaction_start.
 683  However, this behavior can be simulated with a wrapper which uses tdb_add_flags
 684 () and tdb_remove_flags(), so the API should not be expanded for this relatively
 685 -obscure case.
 686 \end_layout
 687
 688 \begin_layout Subsection
 689 Incorrect Hash Function is Not Detected
 690 \end_layout
 691
 692 \begin_layout Standard
 693 tdb_open_ex() allows the calling code to specify a different hash function
 694  to use, but does not check that all other processes accessing this tdb
 695  are using the same hash function.
 696  The result is that records are missing from tdb_fetch().
 697 \end_layout
 698
 699 \begin_layout Subsubsection
 700 Proposed Solution
 701 \end_layout
 702
 703 \begin_layout Standard
 704 The header should contain an example hash result (eg.
 705  the hash of 0xdeadbeef), and tdb_open_ex() should check that the given
 706  hash function produces the same answer, or fail the tdb_open call.
 707 \end_layout
 708
 709 \begin_layout Subsection
 710 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
 711 \end_layout
 712
 713 \begin_layout Standard
 714 In response to scalability issues with the free list (
 715 \begin_inset CommandInset ref
 716 LatexCommand ref
 717 reference "TDB-Freelist-Is"
 718
 719 \end_inset
 720
 721 ) two API workarounds have been incorporated in TDB: tdb_set_max_dead()
 722  and the TDB_VOLATILE flag to tdb_open.
 723  The latter actually calls the former with an argument of
 724 \begin_inset Quotes eld
 725 \end_inset
 726
 727 5
 728 \begin_inset Quotes erd
 729 \end_inset
 730
 731 .
 732 \end_layout
 733
 734 \begin_layout Standard
 735 This code allows deleted records to accumulate without putting them in the
 736  free list.
 737  On delete we iterate through each chain and free them in a batch if there
 738  are more than max_dead entries.
 739  These are never otherwise recycled except as a side-effect of a tdb_repack.
 740 \end_layout
 741
 742 \begin_layout Subsubsection
 743 Proposed Solution
 744 \end_layout
 745
 746 \begin_layout Standard
 747 With the scalability problems of the freelist solved, this API can be removed.
 748  The TDB_VOLATILE flag may still be useful as a hint that store and delete
 749  of records will be at least as common as fetch in order to allow some internal
 750  tuning, but initially will become a no-op.
 751 \end_layout
 752
 753 \begin_layout Subsection
 754 \begin_inset CommandInset label
 755 LatexCommand label
 756 name "TDB-Files-Cannot"
 757
 758 \end_inset
 759
 760 TDB Files Cannot Be Opened Multiple Times In The Same Process
 761 \end_layout
 762
 763 \begin_layout Standard
 764 No process can open the same TDB twice; we check and disallow it.
 765  This is an unfortunate side-effect of fcntl locks, which operate on a per-file
 766  rather than per-file-descriptor basis, and do not nest.
 767  Thus, closing any file descriptor on a file clears all the locks obtained
 768  by this process, even if they were placed using a different file descriptor!
 769 \end_layout
 770
 771 \begin_layout Standard
 772 Note that even if this were solved, deadlock could occur if operations were
 773  nested: this is a more manageable programming error in most cases.
 774 \end_layout
 775
 776 \begin_layout Subsubsection
 777 Proposed Solution
 778 \end_layout
 779
 780 \begin_layout Standard
 781 We could lobby POSIX to fix the perverse rules, or at least lobby Linux
 782  to violate them so that the most common implementation does not have this
 783  restriction.
 784  This would be a generally good idea for other fcntl lock users.
 785 \end_layout
 786
 787 \begin_layout Standard
 788 Samba uses a wrapper which hands out the same tdb_context to multiple callers
 789  if this happens, and does simple reference counting.
 790  We should do this inside the tdb library, which already emulates lock nesting
 791  internally; it would need to recognize when deadlock occurs within a single
 792  process.
 793  This would create a new failure mode for tdb operations (while we currently
 794  handle locking failures, they are impossible in normal use and a process
 795  encountering them can do little but give up).
 796 \end_layout
 797
 798 \begin_layout Standard
 799 I do not see benefit in an additional tdb_open flag to indicate whether
 800  re-opening is allowed, as though there may be some benefit to adding a
 801  call to detect when a tdb_context is shared, to allow other to create such
 802  an API.
 803 \end_layout
 804
 805 \begin_layout Subsection
 806 TDB API Is Not POSIX Thread-safe
 807 \end_layout
 808
 809 \begin_layout Standard
 810 The TDB API uses an error code which can be queried after an operation to
 811  determine what went wrong.
 812  This programming model does not work with threads, unless specific additional
 813  guarantees are given by the implementation.
 814  In addition, even otherwise-independent threads cannot open the same TDB
 815  (as in
 816 \begin_inset CommandInset ref
 817 LatexCommand ref
 818 reference "TDB-Files-Cannot"
 819
 820 \end_inset
 821
 822 ).
 823 \end_layout
 824
 825 \begin_layout Subsubsection
 826 Proposed Solution
 827 \end_layout
 828
 829 \begin_layout Standard
 830 Reachitecting the API to include a tdb_errcode pointer would be a great
 831  deal of churn; we are better to guarantee that the tdb_errcode is per-thread
 832  so the current programming model can be maintained.
 833 \end_layout
 834
 835 \begin_layout Standard
 836 This requires dynamic per-thread allocations, which is awkward with POSIX
 837  threads (pthread_key_create space is limited and we cannot simply allocate
 838  a key for every TDB).
 839 \end_layout
 840
 841 \begin_layout Standard
 842 Internal locking is required to make sure that fcntl locks do not overlap
 843  between threads, and also that the global list of tdbs is maintained.
 844 \end_layout
 845
 846 \begin_layout Standard
 847 The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe
 848  version of the library, and otherwise no overhead will exist.
 849
 850 \change_inserted 0 1284016998
 851  Alternatively, a hooking mechanism similar to that proposed for
 852 \begin_inset CommandInset ref
 853 LatexCommand ref
 854 reference "Proposed-Solution-locking-hook"
 855
 856 \end_inset
 857
 858  could be used to enable pthread locking at runtime.
 859 \change_unchanged
 860
 861 \end_layout
 862
 863 \begin_layout Subsection
 864 *_nonblock Functions And *_mark Functions Expose Implementation
 865 \end_layout
 866
 867 \begin_layout Standard
 868 CTDB
 869 \begin_inset Foot
 870 status collapsed
 871
 872 \begin_layout Plain Layout
 873 Clustered TDB, see http://ctdb.samba.org
 874 \end_layout
 875
 876 \end_inset
 877
 878  wishes to operate on TDB in a non-blocking manner.
 879  This is currently done as follows:
 880 \end_layout
 881
 882 \begin_layout Enumerate
 883 Call the _nonblock variant of an API function (eg.
 884  tdb_lockall_nonblock).
 885  If this fails:
 886 \end_layout
 887
 888 \begin_layout Enumerate
 889 Fork a child process, and wait for it to call the normal variant (eg.
 890  tdb_lockall).
 891 \end_layout
 892
 893 \begin_layout Enumerate
 894 If the child succeeds, call the _mark variant to indicate we already have
 895  the locks (eg.
 896  tdb_lockall_mark).
 897 \end_layout
 898
 899 \begin_layout Enumerate
 900 Upon completion, tell the child to release the locks (eg.
 901  tdb_unlockall).
 902 \end_layout
 903
 904 \begin_layout Enumerate
 905 Indicate to tdb that it should consider the locks removed (eg.
 906  tdb_unlockall_mark).
 907 \end_layout
 908
 909 \begin_layout Standard
 910 There are several issues with this approach.
 911  Firstly, adding two new variants of each function clutters the API for
 912  an obscure use, and so not all functions have three variants.
 913  Secondly, it assumes that all paths of the functions ask for the same locks,
 914  otherwise the parent process will have to get a lock which the child doesn't
 915  have under some circumstances.
 916  I don't believe this is currently the case, but it constrains the implementatio
 917 n.
 918
 919 \end_layout
 920
 921 \begin_layout Subsubsection
 922 \begin_inset CommandInset label
 923 LatexCommand label
 924 name "Proposed-Solution-locking-hook"
 925
 926 \end_inset
 927
 928 Proposed Solution
 929 \end_layout
 930
 931 \begin_layout Standard
 932 Implement a hook for locking methods, so that the caller can control the
 933  calls to create and remove fcntl locks.
 934  In this scenario, ctdbd would operate as follows:
 935 \end_layout
 936
 937 \begin_layout Enumerate
 938 Call the normal API function, eg tdb_lockall().
 939 \end_layout
 940
 941 \begin_layout Enumerate
 942 When the lock callback comes in, check if the child has the lock.
 943  Initially, this is always false.
 944  If so, return 0.
 945  Otherwise, try to obtain it in non-blocking mode.
 946  If that fails, return EWOULDBLOCK.
 947 \end_layout
 948
 949 \begin_layout Enumerate
 950 Release locks in the unlock callback as normal.
 951 \end_layout
 952
 953 \begin_layout Enumerate
 954 If tdb_lockall() fails, see if we recorded a lock failure; if so, call the
 955  child to repeat the operation.
 956 \end_layout
 957
 958 \begin_layout Enumerate
 959 The child records what locks it obtains, and returns that information to
 960  the parent.
 961 \end_layout
 962
 963 \begin_layout Enumerate
 964 When the child has succeeded, goto 1.
 965 \end_layout
 966
 967 \begin_layout Standard
 968 This is flexible enough to handle any potential locking scenario, even when
 969  lock requirements change.
 970  It can be optimized so that the parent does not release locks, just tells
 971  the child which locks it doesn't need to obtain.
 972 \end_layout
 973
 974 \begin_layout Standard
 975 It also keeps the complexity out of the API, and in ctdbd where it is needed.
 976 \end_layout
 977
 978 \begin_layout Subsection
 979 tdb_chainlock Functions Expose Implementation
 980 \end_layout
 981
 982 \begin_layout Standard
 983 tdb_chainlock locks some number of records, including the record indicated
 984  by the given key.
 985  This gave atomicity guarantees; no-one can start a transaction, alter,
 986  read or delete that key while the lock is held.
 987 \end_layout
 988
 989 \begin_layout Standard
 990 It also makes the same guarantee for any other key in the chain, which is
 991  an internal implementation detail and potentially a cause for deadlock.
 992 \end_layout
 993
 994 \begin_layout Subsubsection
 995 Proposed Solution
 996 \end_layout
 997
 998 \begin_layout Standard
 999 None.
1000  It would be nice to have an explicit single entry lock which effected no
1001  other keys.
1002  Unfortunately, this won't work for an entry which doesn't exist.
1003  Thus while chainlock may be implemented more efficiently for the existing
1004  case, it will still have overlap issues with the non-existing case.
1005  So it is best to keep the current (lack of) guarantee about which records
1006  will be effected to avoid constraining our implementation.
1007 \end_layout
1008
1009 \begin_layout Subsection
1010 Signal Handling is Not Race-Free
1011 \end_layout
1012
1013 \begin_layout Standard
1014 The tdb_setalarm_sigptr() call allows the caller's signal handler to indicate
1015  that the tdb locking code should return with a failure, rather than trying
1016  again when a signal is received (and errno == EAGAIN).
1017  This is usually used to implement timeouts.
1018 \end_layout
1019
1020 \begin_layout Standard
1021 Unfortunately, this does not work in the case where the signal is received
1022  before the tdb code enters the fcntl() call to place the lock: the code
1023  will sleep within the fcntl() code, unaware that the signal wants it to
1024  exit.
1025  In the case of long timeouts, this does not happen in practice.
1026 \end_layout
1027
1028 \begin_layout Subsubsection
1029 Proposed Solution
1030 \end_layout
1031
1032 \begin_layout Standard
1033 The locking hooks proposed in
1034 \begin_inset CommandInset ref
1035 LatexCommand ref
1036 reference "Proposed-Solution-locking-hook"
1037
1038 \end_inset
1039
1040  would allow the user to decide on whether to fail the lock acquisition
1041  on a signal.
1042  This allows the caller to choose their own compromise: they could narrow
1043  the race by checking immediately before the fcntl call.
1044 \begin_inset Foot
1045 status collapsed
1046
1047 \begin_layout Plain Layout
1048 It may be possible to make this race-free in some implementations by having
1049  the signal handler alter the struct flock to make it invalid.
1050  This will cause the fcntl() lock call to fail with EINVAL if the signal
1051  occurs before the kernel is entered, otherwise EAGAIN.
1052 \end_layout
1053
1054 \end_inset
1055
1056
1057 \end_layout
1058
1059 \begin_layout Subsection
1060 The API Uses Gratuitous Typedefs, Capitals
1061 \end_layout
1062
1063 \begin_layout Standard
1064 typedefs are useful for providing source compatibility when types can differ
1065  across implementations, or arguably in the case of function pointer definitions
1066  which are hard for humans to parse.
1067  Otherwise it is simply obfuscation and pollutes the namespace.
1068 \end_layout
1069
1070 \begin_layout Standard
1071 Capitalization is usually reserved for compile-time constants and macros.
1072 \end_layout
1073
1074 \begin_layout Description
1075 TDB_CONTEXT There is no reason to use this over 'struct tdb_context'; the
1076  definition isn't visible to the API user anyway.
1077 \end_layout
1078
1079 \begin_layout Description
1080 TDB_DATA There is no reason to use this over struct TDB_DATA; the struct
1081  needs to be understood by the API user.
1082 \end_layout
1083
1084 \begin_layout Description
1085 struct
1086 \begin_inset space ~
1087 \end_inset
1088
1089 TDB_DATA This would normally be called 'struct tdb_data'.
1090 \end_layout
1091
1092 \begin_layout Description
1093 enum
1094 \begin_inset space ~
1095 \end_inset
1096
1097 TDB_ERROR Similarly, this would normally be enum tdb_error.
1098 \end_layout
1099
1100 \begin_layout Subsubsection
1101 Proposed Solution
1102 \end_layout
1103
1104 \begin_layout Standard
1105 None.
1106  Introducing lower case variants would please pedants like myself, but if
1107  it were done the existing ones should be kept.
1108  There is little point forcing a purely cosmetic change upon tdb users.
1109 \end_layout
1110
1111 \begin_layout Subsection
1112 \begin_inset CommandInset label
1113 LatexCommand label
1114 name "tdb_log_func-Doesnt-Take"
1115
1116 \end_inset
1117
1118 tdb_log_func Doesn't Take The Private Pointer
1119 \end_layout
1120
1121 \begin_layout Standard
1122 For API compatibility reasons, the logging function needs to call tdb_get_loggin
1123 g_private() to retrieve the pointer registered by the tdb_open_ex for logging.
1124 \end_layout
1125
1126 \begin_layout Subsubsection
1127 Proposed Solution
1128 \end_layout
1129
1130 \begin_layout Standard
1131 It should simply take an extra argument, since we are prepared to break
1132  the API/ABI.
1133 \end_layout
1134
1135 \begin_layout Subsection
1136 Various Callback Functions Are Not Typesafe
1137 \end_layout
1138
1139 \begin_layout Standard
1140 The callback functions in tdb_set_logging_function (after
1141 \begin_inset CommandInset ref
1142 LatexCommand ref
1143 reference "tdb_log_func-Doesnt-Take"
1144
1145 \end_inset
1146
1147  is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read and tdb_check
1148  all take void * and must internally convert it to the argument type they
1149  were expecting.
1150 \end_layout
1151
1152 \begin_layout Standard
1153 If this type changes, the compiler will not produce warnings on the callers,
1154  since it only sees void *.
1155 \end_layout
1156
1157 \begin_layout Subsubsection
1158 Proposed Solution
1159 \end_layout
1160
1161 \begin_layout Standard
1162 With careful use of macros, we can create callback functions which give
1163  a warning when used on gcc and the types of the callback and its private
1164  argument differ.
1165  Unsupported compilers will not give a warning, which is no worse than now.
1166  In addition, the callbacks become clearer, as they need not use void *
1167  for their parameter.
1168 \end_layout
1169
1170 \begin_layout Standard
1171 See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html
1172 \end_layout
1173
1174 \begin_layout Subsection
1175 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic
1176 \end_layout
1177
1178 \begin_layout Standard
1179 The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB file should
1180  be cleared if the caller discovers it is the only process with the TDB
1181  open.
1182  However, if any caller does not specify TDB_CLEAR_IF_FIRST it will not
1183  be detected, so will have the TDB erased underneath them (usually resulting
1184  in a crash).
1185 \end_layout
1186
1187 \begin_layout Standard
1188 There is a similar issue on fork(); if the parent exits (or otherwise closes
1189  the tdb) before the child calls tdb_reopen_all() to establish the lock
1190  used to indicate the TDB is opened by someone, a TDB_CLEAR_IF_FIRST opener
1191  at that moment will believe it alone has opened the TDB and will erase
1192  it.
1193 \end_layout
1194
1195 \begin_layout Subsubsection
1196 Proposed Solution
1197 \end_layout
1198
1199 \begin_layout Standard
1200 Remove TDB_CLEAR_IF_FIRST.
1201  Other workarounds are possible, but see
1202 \begin_inset CommandInset ref
1203 LatexCommand ref
1204 reference "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1205
1206 \end_inset
1207
1208 .
1209 \change_inserted 0 1284015637
1210
1211 \end_layout
1212
1213 \begin_layout Subsection
1214
1215 \change_inserted 0 1284015716
1216 Extending The Header Is Difficult
1217 \end_layout
1218
1219 \begin_layout Standard
1220
1221 \change_inserted 0 1284015906
1222 We have reserved (zeroed) words in the TDB header, which can be used for
1223  future features.
1224  If the future features are compulsory, the version number must be updated
1225  to prevent old code from accessing the database.
1226  But if the future feature is optional, we have no way of telling if older
1227  code is accessing the database or not.
1228 \end_layout
1229
1230 \begin_layout Subsubsection
1231
1232 \change_inserted 0 1284015637
1233 Proposed Solution
1234 \end_layout
1235
1236 \begin_layout Standard
1237
1238 \change_inserted 0 1284016114
1239 The header should contain a
1240 \begin_inset Quotes eld
1241 \end_inset
1242
1243 format variant
1244 \begin_inset Quotes erd
1245 \end_inset
1246
1247  value (64-bit).
1248  This is divided into two 32-bit parts:
1249 \end_layout
1250
1251 \begin_layout Enumerate
1252
1253 \change_inserted 0 1284016149
1254 The lower part reflects the format variant understood by code accessing
1255  the database.
1256 \end_layout
1257
1258 \begin_layout Enumerate
1259
1260 \change_inserted 0 1284016639
1261 The upper part reflects the format variant you must understand to write
1262  to the database (otherwise you can only open for reading).
1263 \end_layout
1264
1265 \begin_layout Standard
1266
1267 \change_inserted 0 1284016821
1268 The latter field can only be written at creation time, the former should
1269  be written under the OPEN_LOCK when opening the database for writing, if
1270  the variant of the code is lower than the current lowest variant.
1271 \end_layout
1272
1273 \begin_layout Standard
1274
1275 \change_inserted 0 1284016803
1276 This should allow backwards-compatible features to be added, and detection
1277  if older code (which doesn't understand the feature) writes to the database.
1278 \change_deleted 0 1284016101
1279
1280 \end_layout
1281
1282 \begin_layout Subsection
1283
1284 \change_inserted 0 1284015634
1285 Record Headers Are Not Expandible
1286 \end_layout
1287
1288 \begin_layout Standard
1289
1290 \change_inserted 0 1284015634
1291 If we later want to add (say) checksums on keys and data, it would require
1292  another format change, which we'd like to avoid.
1293 \end_layout
1294
1295 \begin_layout Subsubsection
1296
1297 \change_inserted 0 1284015634
1298 Proposed Solution
1299 \end_layout
1300
1301 \begin_layout Standard
1302
1303 \change_inserted 0 1284422552
1304 We often have extra padding at the tail of a record.
1305  If we ensure that the first byte (if any) of this padding is zero, we will
1306  have a way for future changes to detect code which doesn't understand a
1307  new format: the new code would write (say) a 1 at the tail, and thus if
1308  there is no tail or the first byte is 0, we would know the extension is
1309  not present on that record.
1310 \end_layout
1311
1312 \begin_layout Subsection
1313
1314 \change_inserted 0 1284422568
1315 TDB Does Not Use Talloc
1316 \end_layout
1317
1318 \begin_layout Standard
1319
1320 \change_inserted 0 1284422646
1321 Many users of TDB (particularly Samba) use the talloc allocator, and thus
1322  have to wrap TDB in a talloc context to use it conveniently.
1323 \end_layout
1324
1325 \begin_layout Subsubsection
1326
1327 \change_inserted 0 1284422656
1328 Proposed Solution
1329 \end_layout
1330
1331 \begin_layout Standard
1332
1333 \change_inserted 0 1284423065
1334 The allocation within TDB is not complicated enough to justify the use of
1335  talloc, and I am reluctant to force another (excellent) library on TDB
1336  users.
1337  Nonetheless a compromise is possible.
1338  An attribute (see
1339 \begin_inset CommandInset ref
1340 LatexCommand ref
1341 reference "attributes"
1342
1343 \end_inset
1344
1345 ) can be added later to tdb_open() to provide an alternate allocation mechanism,
1346  specifically for talloc but usable by any other allocator (which would
1347  ignore the
1348 \begin_inset Quotes eld
1349 \end_inset
1350
1351 context
1352 \begin_inset Quotes erd
1353 \end_inset
1354
1355  argument).
1356 \end_layout
1357
1358 \begin_layout Standard
1359
1360 \change_inserted 0 1284423042
1361 This would form a talloc heirarchy as expected, but the caller would still
1362  have to attach a destructor to the tdb context returned from tdb_open to
1363  close it.
1364  All TDB_DATA fields would be children of the tdb_context, and the caller
1365  would still have to manage them (using talloc_free() or talloc_steal()).
1366 \change_unchanged
1367
1368 \end_layout
1369
1370 \begin_layout Section
1371 Performance And Scalability Issues
1372 \end_layout
1373
1374 \begin_layout Subsection
1375 \begin_inset CommandInset label
1376 LatexCommand label
1377 name "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1378
1379 \end_inset
1380
1381 TDB_CLEAR_IF_FIRST Imposes Performance Penalty
1382 \end_layout
1383
1384 \begin_layout Standard
1385 When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is placed at offset
1386  4 (aka.
1387  the ACTIVE_LOCK).
1388  While these locks never conflict in normal tdb usage, they do add substantial
1389  overhead for most fcntl lock implementations when the kernel scans to detect
1390  if a lock conflict exists.
1391  This is often a single linked list, making the time to acquire and release
1392  a fcntl lock O(N) where N is the number of processes with the TDB open,
1393  not the number actually doing work.
1394 \end_layout
1395
1396 \begin_layout Standard
1397 In a Samba server it is common to have huge numbers of clients sitting idle,
1398  and thus they have weaned themselves off the TDB_CLEAR_IF_FIRST flag.
1399 \begin_inset Foot
1400 status collapsed
1401
1402 \begin_layout Plain Layout
1403 There is a flag to tdb_reopen_all() which is used for this optimization:
1404  if the parent process will outlive the child, the child does not need the
1405  ACTIVE_LOCK.
1406  This is a workaround for this very performance issue.
1407 \end_layout
1408
1409 \end_inset
1410
1411
1412 \end_layout
1413
1414 \begin_layout Subsubsection
1415 Proposed Solution
1416 \end_layout
1417
1418 \begin_layout Standard
1419 Remove the flag.
1420  It was a neat idea, but even trivial servers tend to know when they are
1421  initializing for the first time and can simply unlink the old tdb at that
1422  point.
1423 \end_layout
1424
1425 \begin_layout Subsection
1426 TDB Files Have a 4G Limit
1427 \end_layout
1428
1429 \begin_layout Standard
1430 This seems to be becoming an issue (so much for
1431 \begin_inset Quotes eld
1432 \end_inset
1433
1434 trivial
1435 \begin_inset Quotes erd
1436 \end_inset
1437
1438 !), particularly for ldb.
1439 \end_layout
1440
1441 \begin_layout Subsubsection
1442 Proposed Solution
1443 \end_layout
1444
1445 \begin_layout Standard
1446 A new, incompatible TDB format which uses 64 bit offsets internally rather
1447  than 32 bit as now.
1448  For simplicity of endian conversion (which TDB does on the fly if required),
1449  all values will be 64 bit on disk.
1450  In practice, some upper bits may be used for other purposes, but at least
1451  56 bits will be available for file offsets.
1452 \end_layout
1453
1454 \begin_layout Standard
1455 tdb_open() will automatically detect the old version, and even create them
1456  if TDB_VERSION6 is specified to tdb_open.
1457 \end_layout
1458
1459 \begin_layout Standard
1460 32 bit processes will still be able to access TDBs larger than 4G (assuming
1461  that their off_t allows them to seek to 64 bits), they will gracefully
1462  fall back as they fail to mmap.
1463  This can happen already with large TDBs.
1464 \end_layout
1465
1466 \begin_layout Standard
1467 Old versions of tdb will fail to open the new TDB files (since 28 August
1468  2009, commit 398d0c29290: prior to that any unrecognized file format would
1469  be erased and initialized as a fresh tdb!)
1470 \end_layout
1471
1472 \begin_layout Subsection
1473 TDB Records Have a 4G Limit
1474 \end_layout
1475
1476 \begin_layout Standard
1477 This has not been a reported problem, and the API uses size_t which can
1478  be 64 bit on 64 bit platforms.
1479  However, other limits may have made such an issue moot.
1480 \end_layout
1481
1482 \begin_layout Subsubsection
1483 Proposed Solution
1484 \end_layout
1485
1486 \begin_layout Standard
1487 Record sizes will be 64 bit, with an error returned on 32 bit platforms
1488  which try to access such records (the current implementation would return
1489  TDB_ERR_OOM in a similar case).
1490  It seems unlikely that 32 bit keys will be a limitation, so the implementation
1491  may not support this (see
1492 \begin_inset CommandInset ref
1493 LatexCommand ref
1494 reference "sub:Records-Incur-A"
1495
1496 \end_inset
1497
1498 ).
1499 \end_layout
1500
1501 \begin_layout Subsection
1502 Hash Size Is Determined At TDB Creation Time
1503 \end_layout
1504
1505 \begin_layout Standard
1506 TDB contains a number of hash chains in the header; the number is specified
1507  at creation time, and defaults to 131.
1508  This is such a bottleneck on large databases (as each hash chain gets quite
1509  long), that LDB uses 10,000 for this hash.
1510  In general it is impossible to know what the 'right' answer is at database
1511  creation time.
1512 \end_layout
1513
1514 \begin_layout Subsubsection
1515
1516 \change_inserted 0 1283336713
1517 \begin_inset CommandInset label
1518 LatexCommand label
1519 name "sub:Hash-Size-Solution"
1520
1521 \end_inset
1522
1523
1524 \change_unchanged
1525 Proposed Solution
1526 \end_layout
1527
1528 \begin_layout Standard
1529 After comprehensive performance testing on various scalable hash variants
1530 \begin_inset Foot
1531 status collapsed
1532
1533 \begin_layout Plain Layout
1534 http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying
1535  because I was previously convinced that an expanding tree of hashes would
1536  be very close to optimal.
1537 \end_layout
1538
1539 \end_inset
1540
1541 , it became clear that it is hard to beat a straight linear hash table which
1542  doubles in size when it reaches saturation.
1543
1544 \change_deleted 0 1283307675
1545 There are three details which become important:
1546 \end_layout
1547
1548 \begin_layout Enumerate
1549
1550 \change_deleted 0 1283307675
1551 On encountering a full bucket, we use the next bucket.
1552 \end_layout
1553
1554 \begin_layout Enumerate
1555
1556 \change_deleted 0 1283307675
1557 Extra hash bits are stored with the offset, to reduce comparisons.
1558 \end_layout
1559
1560 \begin_layout Enumerate
1561
1562 \change_deleted 0 1283307675
1563 A marker entry is used on deleting an entry.
1564 \end_layout
1565
1566 \begin_layout Standard
1567
1568 \change_deleted 0 1283307675
1569 The doubling of the table must be done under a transaction; we will not
1570  reduce it on deletion, so it will be an unusual case.
1571  It will either be placed at the head (other entries will be moved out the
1572  way so we can expand).
1573  We could have a pointer in the header to the current hashtable location,
1574  but that pointer would have to be read frequently to check for hashtable
1575  moves.
1576 \end_layout
1577
1578 \begin_layout Standard
1579
1580 \change_deleted 0 1283307675
1581 The locking for this is slightly more complex than the chained case; we
1582  currently have one lock per bucket, and that means we would need to expand
1583  the lock if we overflow to the next bucket.
1584  The frequency of such collisions will effect our locking heuristics: we
1585  can always lock more buckets than we need.
1586 \end_layout
1587
1588 \begin_layout Standard
1589
1590 \change_deleted 0 1283307675
1591 One possible optimization is to only re-check the hash size on an insert
1592  or a lookup miss.
1593
1594 \change_inserted 0 1283307770
1595  Unfortunately, altering the hash table introduces serious locking complications
1596 : the entire hash table needs to be locked to enlarge the hash table, and
1597  others might be holding locks.
1598  Particularly insidious are insertions done under tdb_chainlock.
1599 \end_layout
1600
1601 \begin_layout Standard
1602
1603 \change_inserted 0 1283336187
1604 Thus an expanding layered hash will be used: an array of hash groups, with
1605  each hash group exploding into pointers to lower hash groups once it fills,
1606  turning into a hash tree.
1607  This has implications for locking: we must lock the entire group in case
1608  we need to expand it, yet we don't know how deep the tree is at that point.
1609 \end_layout
1610
1611 \begin_layout Standard
1612
1613 \change_inserted 0 1283336586
1614 Note that bits from the hash table entries should be stolen to hold more
1615  hash bits to reduce the penalty of collisions.
1616  We can use the otherwise-unused lower 3 bits.
1617  If we limit the size of the database to 64 exabytes, we can use the top
1618  8 bits of the hash entry as well.
1619  These 11 bits would reduce false positives down to 1 in 2000 which is more
1620  than we need: we can use one of the bits to indicate that the extra hash
1621  bits are valid.
1622  This means we can choose not to re-hash all entries when we expand a hash
1623  group; simply use the next bits we need and mark them invalid.
1624 \change_unchanged
1625
1626 \end_layout
1627
1628 \begin_layout Subsection
1629 \begin_inset CommandInset label
1630 LatexCommand label
1631 name "TDB-Freelist-Is"
1632
1633 \end_inset
1634
1635 TDB Freelist Is Highly Contended
1636 \end_layout
1637
1638 \begin_layout Standard
1639 TDB uses a single linked list for the free list.
1640  Allocation occurs as follows, using heuristics which have evolved over
1641  time:
1642 \end_layout
1643
1644 \begin_layout Enumerate
1645 Get the free list lock for this whole operation.
1646 \end_layout
1647
1648 \begin_layout Enumerate
1649 Multiply length by 1.25, so we always over-allocate by 25%.
1650 \end_layout
1651
1652 \begin_layout Enumerate
1653 Set the slack multiplier to 1.
1654 \end_layout
1655
1656 \begin_layout Enumerate
1657 Examine the current freelist entry: if it is > length but < the current
1658  best case, remember it as the best case.
1659 \end_layout
1660
1661 \begin_layout Enumerate
1662 Multiply the slack multiplier by 1.05.
1663 \end_layout
1664
1665 \begin_layout Enumerate
1666 If our best fit so far is less than length * slack multiplier, return it.
1667  The slack will be turned into a new free record if it's large enough.
1668 \end_layout
1669
1670 \begin_layout Enumerate
1671 Otherwise, go onto the next freelist entry.
1672 \end_layout
1673
1674 \begin_layout Standard
1675 Deleting a record occurs as follows:
1676 \end_layout
1677
1678 \begin_layout Enumerate
1679 Lock the hash chain for this whole operation.
1680 \end_layout
1681
1682 \begin_layout Enumerate
1683 Walk the chain to find the record, keeping the prev pointer offset.
1684 \end_layout
1685
1686 \begin_layout Enumerate
1687 If max_dead is non-zero:
1688 \end_layout
1689
1690 \begin_deeper
1691 \begin_layout Enumerate
1692 Walk the hash chain again and count the dead records.
1693 \end_layout
1694
1695 \begin_layout Enumerate
1696 If it's more than max_dead, bulk free all the dead ones (similar to steps
1697  4 and below, but the lock is only obtained once).
1698 \end_layout
1699
1700 \begin_layout Enumerate
1701 Simply mark this record as dead and return.
1702
1703 \end_layout
1704
1705 \end_deeper
1706 \begin_layout Enumerate
1707 Get the free list lock for the remainder of this operation.
1708 \end_layout
1709
1710 \begin_layout Enumerate
1711 \begin_inset CommandInset label
1712 LatexCommand label
1713 name "right-merging"
1714
1715 \end_inset
1716
1717 Examine the following block to see if it is free; if so, enlarge the current
1718  block and remove that block from the free list.
1719  This was disabled, as removal from the free list was O(entries-in-free-list).
1720 \end_layout
1721
1722 \begin_layout Enumerate
1723 Examine the preceeding block to see if it is free: for this reason, each
1724  block has a 32-bit tailer which indicates its length.
1725  If it is free, expand it to cover our new block and return.
1726 \end_layout
1727
1728 \begin_layout Enumerate
1729 Otherwise, prepend ourselves to the free list.
1730 \end_layout
1731
1732 \begin_layout Standard
1733 Disabling right-merging (step
1734 \begin_inset CommandInset ref
1735 LatexCommand ref
1736 reference "right-merging"
1737
1738 \end_inset
1739
1740 ) causes fragmentation; the other heuristics proved insufficient to address
1741  this, so the final answer to this was that when we expand the TDB file
1742  inside a transaction commit, we repack the entire tdb.
1743 \end_layout
1744
1745 \begin_layout Standard
1746 The single list lock limits our allocation rate; due to the other issues
1747  this is not currently seen as a bottleneck.
1748 \end_layout
1749
1750 \begin_layout Subsubsection
1751 Proposed Solution
1752 \change_deleted 0 1283336858
1753
1754 \end_layout
1755
1756 \begin_layout Standard
1757 The first step is to remove all the current heuristics, as they obviously
1758  interact, then examine them once the lock contention is addressed.
1759 \end_layout
1760
1761 \begin_layout Standard
1762 The free list must be split to reduce contention.
1763  Assuming perfect free merging, we can at most have 1 free list entry for
1764  each entry.
1765  This implies that the number of free lists is related to the size of the
1766  hash table, but as it is rare to walk a large number of free list entries
1767  we can use far fewer, say 1/32 of the number of hash buckets.
1768 \change_inserted 0 1283336910
1769
1770 \end_layout
1771
1772 \begin_layout Standard
1773
1774 \change_inserted 0 1283337052
1775 It seems tempting to try to reuse the hash implementation which we use for
1776  records here, but we have two ways of searching for free entries: for allocatio
1777 n we search by size (and possibly zone) which produces too many clashes
1778  for our hash table to handle well, and for coalescing we search by address.
1779  Thus an array of doubly-linked free lists seems preferable.
1780 \change_unchanged
1781
1782 \end_layout
1783
1784 \begin_layout Standard
1785 There are various benefits in using per-size free lists (see
1786 \begin_inset CommandInset ref
1787 LatexCommand ref
1788 reference "sub:TDB-Becomes-Fragmented"
1789
1790 \end_inset
1791
1792 ) but it's not clear this would reduce contention in the common case where
1793  all processes are allocating/freeing the same size.
1794  Thus we almost certainly need to divide in other ways: the most obvious
1795  is to divide the file into zones, and using a free list (or set of free
1796  lists) for each.
1797  This approximates address ordering.
1798 \end_layout
1799
1800 \begin_layout Standard
1801 Note that this means we need to split the free lists when we expand the
1802  file; this is probably acceptable when we double the hash table size, since
1803  that is such an expensive operation already.
1804  In the case of increasing the file size, there is an optimization we can
1805  use: if we use M in the formula above as the file size rounded up to the
1806  next power of 2, we only need reshuffle free lists when the file size crosses
1807  a power of 2 boundary,
1808 \emph on
1809 and
1810 \emph default
1811 reshuffling the free lists is trivial: we simply merge every consecutive
1812  pair of free lists.
1813 \end_layout
1814
1815 \begin_layout Standard
1816 The basic algorithm is as follows.
1817  Freeing is simple:
1818 \end_layout
1819
1820 \begin_layout Enumerate
1821 Identify the correct zone.
1822 \end_layout
1823
1824 \begin_layout Enumerate
1825 Lock the corresponding list.
1826 \end_layout
1827
1828 \begin_layout Enumerate
1829 Re-check the zone (we didn't have a lock, sizes could have changed): relock
1830  if necessary.
1831 \end_layout
1832
1833 \begin_layout Enumerate
1834 Place the freed entry in the list for that zone.
1835 \end_layout
1836
1837 \begin_layout Standard
1838 Allocation is a little more complicated, as we perform delayed coalescing
1839  at this point:
1840 \end_layout
1841
1842 \begin_layout Enumerate
1843 Pick a zone either the zone we last freed into, or based on a
1844 \begin_inset Quotes eld
1845 \end_inset
1846
1847 random
1848 \begin_inset Quotes erd
1849 \end_inset
1850
1851  number.
1852 \end_layout
1853
1854 \begin_layout Enumerate
1855 Lock the corresponding list.
1856 \end_layout
1857
1858 \begin_layout Enumerate
1859 Re-check the zone: relock if necessary.
1860 \end_layout
1861
1862 \begin_layout Enumerate
1863 If the top entry is -large enough, remove it from the list and return it.
1864 \end_layout
1865
1866 \begin_layout Enumerate
1867 Otherwise, coalesce entries in the list.If there was no entry large enough,
1868  unlock the list and try the next zone.
1869 \end_layout
1870
1871 \begin_layout Enumerate
1872 If no zone satisfies, expand the file.
1873 \end_layout
1874
1875 \begin_layout Standard
1876 This optimizes rapid insert/delete of free list entries by not coalescing
1877  them all the time..
1878  First-fit address ordering ordering seems to be fairly good for keeping
1879  fragmentation low (see
1880 \begin_inset CommandInset ref
1881 LatexCommand ref
1882 reference "sub:TDB-Becomes-Fragmented"
1883
1884 \end_inset
1885
1886 ).
1887  Note that address ordering does not need a tailer to coalesce, though if
1888  we needed one we could have one cheaply: see
1889 \begin_inset CommandInset ref
1890 LatexCommand ref
1891 reference "sub:Records-Incur-A"
1892
1893 \end_inset
1894
1895 .
1896
1897 \end_layout
1898
1899 \begin_layout Standard
1900 I anticipate that the number of entries in each free zone would be small,
1901  but it might be worth using one free entry to hold pointers to the others
1902  for cache efficiency.
1903 \change_inserted 0 1283309850
1904
1905 \end_layout
1906
1907 \begin_layout Standard
1908
1909 \change_inserted 0 1283337216
1910 \begin_inset CommandInset label
1911 LatexCommand label
1912 name "freelist-in-zone"
1913
1914 \end_inset
1915
1916 If we want to avoid locking complexity (enlarging the free lists when we
1917  enlarge the file) we could place the array of free lists at the beginning
1918  of each zone.
1919  This means existing array lists never move, but means that a record cannot
1920  be larger than a zone.
1921  That in turn implies that zones should be variable sized (say, power of
1922  2), which makes the question
1923 \begin_inset Quotes eld
1924 \end_inset
1925
1926 what zone is this record in?
1927 \begin_inset Quotes erd
1928 \end_inset
1929
1930  much harder (and
1931 \begin_inset Quotes eld
1932 \end_inset
1933
1934 pick a random zone
1935 \begin_inset Quotes erd
1936 \end_inset
1937
1938 , but that's less common).
1939  It could be done with as few as 4 bits from the record header.
1940 \begin_inset Foot
1941 status open
1942
1943 \begin_layout Plain Layout
1944
1945 \change_inserted 0 1284424151
1946 Using
1947 \begin_inset Formula $2^{16+N*3}$
1948 \end_inset
1949
1950 means 0 gives a minimal 65536-byte zone, 15 gives the maximal
1951 \begin_inset Formula $2^{61}$
1952 \end_inset
1953
1954  byte zone.
1955  Zones range in factor of 8 steps.
1956  Given the zone size for the zone the current record is in, we can determine
1957  the start of the zone.
1958 \change_unchanged
1959
1960 \end_layout
1961
1962 \end_inset
1963
1964
1965 \change_unchanged
1966
1967 \end_layout
1968
1969 \begin_layout Subsection
1970 \begin_inset CommandInset label
1971 LatexCommand label
1972 name "sub:TDB-Becomes-Fragmented"
1973
1974 \end_inset
1975
1976 TDB Becomes Fragmented
1977 \end_layout
1978
1979 \begin_layout Standard
1980 Much of this is a result of allocation strategy
1981 \begin_inset Foot
1982 status collapsed
1983
1984 \begin_layout Plain Layout
1985 The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 ftp://ftp.cs.ute
1986 xas.edu/pub/garbage/malloc/ismm98.ps
1987 \end_layout
1988
1989 \end_inset
1990
1991  and deliberate hobbling of coalescing; internal fragmentation (aka overallocati
1992 on) is deliberately set at 25%, and external fragmentation is only cured
1993  by the decision to repack the entire db when a transaction commit needs
1994  to enlarge the file.
1995 \end_layout
1996
1997 \begin_layout Subsubsection
1998 Proposed Solution
1999 \end_layout
2000
2001 \begin_layout Standard
2002 The 25% overhead on allocation works in practice for ldb because indexes
2003  tend to expand by one record at a time.
2004  This internal fragmentation can be resolved by having an
2005 \begin_inset Quotes eld
2006 \end_inset
2007
2008 expanded
2009 \begin_inset Quotes erd
2010 \end_inset
2011
2012  bit in the header to note entries that have previously expanded, and allocating
2013  more space for them.
2014 \end_layout
2015
2016 \begin_layout Standard
2017 There are is a spectrum of possible solutions for external fragmentation:
2018  one is to use a fragmentation-avoiding allocation strategy such as best-fit
2019  address-order allocator.
2020  The other end of the spectrum would be to use a bump allocator (very fast
2021  and simple) and simply repack the file when we reach the end.
2022 \end_layout
2023
2024 \begin_layout Standard
2025 There are three problems with efficient fragmentation-avoiding allocators:
2026  they are non-trivial, they tend to use a single free list for each size,
2027  and there's no evidence that tdb allocation patterns will match those recorded
2028  for general allocators (though it seems likely).
2029 \end_layout
2030
2031 \begin_layout Standard
2032 Thus we don't spend too much effort on external fragmentation; we will be
2033  no worse than the current code if we need to repack on occasion.
2034  More effort is spent on reducing freelist contention, and reducing overhead.
2035 \end_layout
2036
2037 \begin_layout Subsection
2038 \begin_inset CommandInset label
2039 LatexCommand label
2040 name "sub:Records-Incur-A"
2041
2042 \end_inset
2043
2044 Records Incur A 28-Byte Overhead
2045 \end_layout
2046
2047 \begin_layout Standard
2048 Each TDB record has a header as follows:
2049 \end_layout
2050
2051 \begin_layout LyX-Code
2052 struct tdb_record {
2053 \end_layout
2054
2055 \begin_layout LyX-Code
2056         tdb_off_t next; /* offset of the next record in the list */
2057 \end_layout
2058
2059 \begin_layout LyX-Code
2060         tdb_len_t rec_len; /* total byte length of record */
2061 \end_layout
2062
2063 \begin_layout LyX-Code
2064         tdb_len_t key_len; /* byte length of key */
2065 \end_layout
2066
2067 \begin_layout LyX-Code
2068         tdb_len_t data_len; /* byte length of data */
2069 \end_layout
2070
2071 \begin_layout LyX-Code
2072         uint32_t full_hash; /* the full 32 bit hash of the key */
2073 \end_layout
2074
2075 \begin_layout LyX-Code
2076         uint32_t magic;   /* try to catch errors */
2077 \end_layout
2078
2079 \begin_layout LyX-Code
2080         /* the following union is implied:
2081 \end_layout
2082
2083 \begin_layout LyX-Code
2084                 union {
2085 \end_layout
2086
2087 \begin_layout LyX-Code
2088                         char record[rec_len];
2089 \end_layout
2090
2091 \begin_layout LyX-Code
2092                         struct {
2093 \end_layout
2094
2095 \begin_layout LyX-Code
2096                                 char key[key_len];
2097 \end_layout
2098
2099 \begin_layout LyX-Code
2100                                 char data[data_len];
2101 \end_layout
2102
2103 \begin_layout LyX-Code
2104                         }
2105 \end_layout
2106
2107 \begin_layout LyX-Code
2108                         uint32_t totalsize; (tailer)
2109 \end_layout
2110
2111 \begin_layout LyX-Code
2112                 }
2113 \end_layout
2114
2115 \begin_layout LyX-Code
2116         */
2117 \end_layout
2118
2119 \begin_layout LyX-Code
2120 };
2121 \end_layout
2122
2123 \begin_layout Standard
2124 Naively, this would double to a 56-byte overhead on a 64 bit implementation.
2125 \end_layout
2126
2127 \begin_layout Subsubsection
2128 Proposed Solution
2129 \end_layout
2130
2131 \begin_layout Standard
2132 We can use various techniques to reduce this for an allocated block:
2133 \end_layout
2134
2135 \begin_layout Enumerate
2136 The 'next' pointer is not required, as we are using a flat hash table.
2137 \end_layout
2138
2139 \begin_layout Enumerate
2140 'rec_len' can instead be expressed as an addition to key_len and data_len
2141  (it accounts for wasted or overallocated length in the record).
2142  Since the record length is always a multiple of 8, we can conveniently
2143  fit it in 32 bits (representing up to 35 bits).
2144 \end_layout
2145
2146 \begin_layout Enumerate
2147 'key_len' and 'data_len' can be reduced.
2148  I'm unwilling to restrict 'data_len' to 32 bits, but instead we can combine
2149  the two into one 64-bit field and using a 5 bit value which indicates at
2150  what bit to divide the two.
2151  Keys are unlikely to scale as fast as data, so I'm assuming a maximum key
2152  size of 32 bits.
2153 \end_layout
2154
2155 \begin_layout Enumerate
2156 'full_hash' is used to avoid a memcmp on the
2157 \begin_inset Quotes eld
2158 \end_inset
2159
2160 miss
2161 \begin_inset Quotes erd
2162 \end_inset
2163
2164  case, but this is diminishing returns after a handful of bits (at 10 bits,
2165  it reduces 99.9% of false memcmp).
2166  As an aside, as the lower bits are already incorporated in the hash table
2167  resolution, the upper bits should be used here.
2168
2169 \change_inserted 0 1283336739
2170  Note that it's not clear that these bits will be a win, given the extra
2171  bits in the hash table itself (see
2172 \begin_inset CommandInset ref
2173 LatexCommand ref
2174 reference "sub:Hash-Size-Solution"
2175
2176 \end_inset
2177
2178 ).
2179 \change_unchanged
2180
2181 \end_layout
2182
2183 \begin_layout Enumerate
2184 'magic' does not need to be enlarged: it currently reflects one of 5 values
2185  (used, free, dead, recovery, and unused_recovery).
2186  It is useful for quick sanity checking however, and should not be eliminated.
2187 \end_layout
2188
2189 \begin_layout Enumerate
2190 'tailer' is only used to coalesce free blocks (so a block to the right can
2191  find the header to check if this block is free).
2192  This can be replaced by a single 'free' bit in the header of the following
2193  block (and the tailer only exists in free blocks).
2194 \begin_inset Foot
2195 status collapsed
2196
2197 \begin_layout Plain Layout
2198 This technique from Thomas Standish.
2199  Data Structure Techniques.
2200  Addison-Wesley, Reading, Massachusetts, 1980.
2201 \end_layout
2202
2203 \end_inset
2204
2205  The current proposed coalescing algorithm doesn't need this, however.
2206 \end_layout
2207
2208 \begin_layout Standard
2209 This produces a 16 byte used header like this:
2210 \end_layout
2211
2212 \begin_layout LyX-Code
2213 struct tdb_used_record {
2214 \end_layout
2215
2216 \begin_layout LyX-Code
2217         uint32_t magic : 16,
2218 \end_layout
2219
2220 \begin_layout LyX-Code
2221                  prev_is_free: 1,
2222 \end_layout
2223
2224 \begin_layout LyX-Code
2225                  key_data_divide: 5,
2226 \end_layout
2227
2228 \begin_layout LyX-Code
2229                  top_hash: 10;
2230 \end_layout
2231
2232 \begin_layout LyX-Code
2233         uint32_t extra_octets;
2234 \end_layout
2235
2236 \begin_layout LyX-Code
2237         uint64_t key_and_data_len;
2238 \end_layout
2239
2240 \begin_layout LyX-Code
2241 };
2242 \end_layout
2243
2244 \begin_layout Standard
2245 And a free record like this:
2246 \end_layout
2247
2248 \begin_layout LyX-Code
2249 struct tdb_free_record {
2250 \end_layout
2251
2252 \begin_layout LyX-Code
2253         uint32_t free_magic;
2254 \end_layout
2255
2256 \begin_layout LyX-Code
2257         uint64_t total_length;
2258 \change_inserted 0 1283337133
2259
2260 \end_layout
2261
2262 \begin_layout LyX-Code
2263
2264 \change_inserted 0 1283337139
2265         uint64_t prev, next;
2266 \change_unchanged
2267
2268 \end_layout
2269
2270 \begin_layout LyX-Code
2271         ...
2272 \end_layout
2273
2274 \begin_layout LyX-Code
2275         uint64_t tailer;
2276 \end_layout
2277
2278 \begin_layout LyX-Code
2279 };
2280 \end_layout
2281
2282 \begin_layout Standard
2283
2284 \change_inserted 0 1283337235
2285 We might want to take some bits from the used record's top_hash (and the
2286  free record which has 32 bits of padding to spare anyway) if we use variable
2287  sized zones.
2288  See
2289 \begin_inset CommandInset ref
2290 LatexCommand ref
2291 reference "freelist-in-zone"
2292
2293 \end_inset
2294
2295 .
2296 \change_unchanged
2297
2298 \end_layout
2299
2300 \begin_layout Subsection
2301 Transaction Commit Requires 4 fdatasync
2302 \end_layout
2303
2304 \begin_layout Standard
2305 The current transaction algorithm is:
2306 \end_layout
2307
2308 \begin_layout Enumerate
2309 write_recovery_data();
2310 \end_layout
2311
2312 \begin_layout Enumerate
2313 sync();
2314 \end_layout
2315
2316 \begin_layout Enumerate
2317 write_recovery_header();
2318 \end_layout
2319
2320 \begin_layout Enumerate
2321 sync();
2322 \end_layout
2323
2324 \begin_layout Enumerate
2325 overwrite_with_new_data();
2326 \end_layout
2327
2328 \begin_layout Enumerate
2329 sync();
2330 \end_layout
2331
2332 \begin_layout Enumerate
2333 remove_recovery_header();
2334 \end_layout
2335
2336 \begin_layout Enumerate
2337 sync();
2338 \end_layout
2339
2340 \begin_layout Standard
2341 On current ext3, each sync flushes all data to disk, so the next 3 syncs
2342  are relatively expensive.
2343  But this could become a performance bottleneck on other filesystems such
2344  as ext4.
2345 \end_layout
2346
2347 \begin_layout Subsubsection
2348 Proposed Solution
2349 \end_layout
2350
2351 \begin_layout Standard
2352 Neil Brown points out that this is overzealous, and only one sync is needed:
2353 \end_layout
2354
2355 \begin_layout Enumerate
2356 Bundle the recovery data, a transaction counter and a strong checksum of
2357  the new data.
2358 \end_layout
2359
2360 \begin_layout Enumerate
2361 Strong checksum that whole bundle.
2362 \end_layout
2363
2364 \begin_layout Enumerate
2365 Store the bundle in the database.
2366 \end_layout
2367
2368 \begin_layout Enumerate
2369 Overwrite the oldest of the two recovery pointers in the header (identified
2370  using the transaction counter) with the offset of this bundle.
2371 \end_layout
2372
2373 \begin_layout Enumerate
2374 sync.
2375 \end_layout
2376
2377 \begin_layout Enumerate
2378 Write the new data to the file.
2379 \end_layout
2380
2381 \begin_layout Standard
2382 Checking for recovery means identifying the latest bundle with a valid checksum
2383  and using the new data checksum to ensure that it has been applied.
2384  This is more expensive than the current check, but need only be done at
2385  open.
2386  For running databases, a separate header field can be used to indicate
2387  a transaction in progress; we need only check for recovery if this is set.
2388 \end_layout
2389
2390 \begin_layout Subsection
2391 \begin_inset CommandInset label
2392 LatexCommand label
2393 name "sub:TDB-Does-Not"
2394
2395 \end_inset
2396
2397 TDB Does Not Have Snapshot Support
2398 \end_layout
2399
2400 \begin_layout Subsubsection
2401 Proposed Solution
2402 \change_deleted 0 1284423472
2403
2404 \end_layout
2405
2406 \begin_layout Standard
2407 None.
2408  At some point you say
2409 \begin_inset Quotes eld
2410 \end_inset
2411
2412 use a real database
2413 \begin_inset Quotes erd
2414 \end_inset
2415
2416
2417 \change_inserted 0 1284423891
2418
2419 \change_deleted 0 1284423891
2420 .
2421
2422 \change_inserted 0 1284423901
2423  (but see
2424 \begin_inset CommandInset ref
2425 LatexCommand ref
2426 reference "replay-attribute"
2427
2428 \end_inset
2429
2430 ).
2431 \change_unchanged
2432
2433 \end_layout
2434
2435 \begin_layout Standard
2436 But as a thought experiment, if we implemented transactions to only overwrite
2437  free entries (this is tricky: there must not be a header in each entry
2438  which indicates whether it is free, but use of presence in metadata elsewhere),
2439  and a pointer to the hash table, we could create an entirely new commit
2440  without destroying existing data.
2441  Then it would be easy to implement snapshots in a similar way.
2442 \end_layout
2443
2444 \begin_layout Standard
2445 This would not allow arbitrary changes to the database, such as tdb_repack
2446  does, and would require more space (since we have to preserve the current
2447  and future entries at once).
2448  If we used hash trees rather than one big hash table, we might only have
2449  to rewrite some sections of the hash, too.
2450 \end_layout
2451
2452 \begin_layout Standard
2453 We could then implement snapshots using a similar method, using multiple
2454  different hash tables/free tables.
2455 \change_inserted 0 1284423495
2456
2457 \end_layout
2458
2459 \begin_layout Subsection
2460 Transactions Cannot Operate in Parallel
2461 \end_layout
2462
2463 \begin_layout Standard
2464 This would be useless for ldb, as it hits the index records with just about
2465  every update.
2466  It would add significant complexity in resolving clashes, and cause the
2467  all transaction callers to write their code to loop in the case where the
2468  transactions spuriously failed.
2469 \end_layout
2470
2471 \begin_layout Subsubsection
2472 Proposed Solution
2473 \end_layout
2474
2475 \begin_layout Standard
2476
2477 \change_inserted 0 1284424201
2478 None (but see
2479 \begin_inset CommandInset ref
2480 LatexCommand ref
2481 reference "replay-attribute"
2482
2483 \end_inset
2484
2485 ).
2486
2487 \change_unchanged
2488 We could solve a small part of the problem by providing read-only transactions.
2489  These would allow one write transaction to begin, but it could not commit
2490  until all r/o transactions are done.
2491  This would require a new RO_TRANSACTION_LOCK, which would be upgraded on
2492  commit.
2493 \end_layout
2494
2495 \begin_layout Subsection
2496 Default Hash Function Is Suboptimal
2497 \end_layout
2498
2499 \begin_layout Standard
2500 The Knuth-inspired multiplicative hash used by tdb is fairly slow (especially
2501  if we expand it to 64 bits), and works best when the hash bucket size is
2502  a prime number (which also means a slow modulus).
2503  In addition, it is highly predictable which could potentially lead to a
2504  Denial of Service attack in some TDB uses.
2505 \end_layout
2506
2507 \begin_layout Subsubsection
2508 Proposed Solution
2509 \end_layout
2510
2511 \begin_layout Standard
2512 The Jenkins lookup3 hash
2513 \begin_inset Foot
2514 status open
2515
2516 \begin_layout Plain Layout
2517 http://burtleburtle.net/bob/c/lookup3.c
2518 \end_layout
2519
2520 \end_inset
2521
2522  is a fast and superbly-mixing hash.
2523  It's used by the Linux kernel and almost everything else.
2524  This has the particular properties that it takes an initial seed, and produces
2525  two 32 bit hash numbers, which we can combine into a 64-bit hash.
2526 \end_layout
2527
2528 \begin_layout Standard
2529 The seed should be created at tdb-creation time from some random source,
2530  and placed in the header.
2531  This is far from foolproof, but adds a little bit of protection against
2532  hash bombing.
2533 \end_layout
2534
2535 \begin_layout Subsection
2536 \begin_inset CommandInset label
2537 LatexCommand label
2538 name "Reliable-Traversal-Adds"
2539
2540 \end_inset
2541
2542 Reliable Traversal Adds Complexity
2543 \end_layout
2544
2545 \begin_layout Standard
2546 We lock a record during traversal iteration, and try to grab that lock in
2547  the delete code.
2548  If that grab on delete fails, we simply mark it deleted and continue onwards;
2549  traversal checks for this condition and does the delete when it moves off
2550  the record.
2551 \end_layout
2552
2553 \begin_layout Standard
2554 If traversal terminates, the dead record may be left indefinitely.
2555 \end_layout
2556
2557 \begin_layout Subsubsection
2558 Proposed Solution
2559 \end_layout
2560
2561 \begin_layout Standard
2562 Remove reliability guarantees; see
2563 \begin_inset CommandInset ref
2564 LatexCommand ref
2565 reference "traverse-Proposed-Solution"
2566
2567 \end_inset
2568
2569 .
2570 \end_layout
2571
2572 \begin_layout Subsection
2573 Fcntl Locking Adds Overhead
2574 \end_layout
2575
2576 \begin_layout Standard
2577 Placing a fcntl lock means a system call, as does removing one.
2578  This is actually one reason why transactions can be faster (everything
2579  is locked once at transaction start).
2580  In the uncontended case, this overhead can theoretically be eliminated.
2581 \end_layout
2582
2583 \begin_layout Subsubsection
2584 Proposed Solution
2585 \end_layout
2586
2587 \begin_layout Standard
2588 None.
2589 \end_layout
2590
2591 \begin_layout Standard
2592 We tried this before with spinlock support, in the early days of TDB, and
2593  it didn't make much difference except in manufactured benchmarks.
2594 \end_layout
2595
2596 \begin_layout Standard
2597 We could use spinlocks (with futex kernel support under Linux), but it means
2598  that we lose automatic cleanup when a process dies with a lock.
2599  There is a method of auto-cleanup under Linux, but it's not supported by
2600  other operating systems.
2601  We could reintroduce a clear-if-first-style lock and sweep for dead futexes
2602  on open, but that wouldn't help the normal case of one concurrent opener
2603  dying.
2604  Increasingly elaborate repair schemes could be considered, but they require
2605  an ABI change (everyone must use them) anyway, so there's no need to do
2606  this at the same time as everything else.
2607 \end_layout
2608
2609 \begin_layout Subsection
2610 Some Transactions Don't Require Durability
2611 \end_layout
2612
2613 \begin_layout Standard
2614 Volker points out that gencache uses a CLEAR_IF_FIRST tdb for normal (fast)
2615  usage, and occasionally empties the results into a transactional TDB.
2616  This kind of usage prioritizes performance over durability: as long as
2617  we are consistent, data can be lost.
2618 \end_layout
2619
2620 \begin_layout Standard
2621 This would be more neatly implemented inside tdb: a
2622 \begin_inset Quotes eld
2623 \end_inset
2624
2625 soft
2626 \begin_inset Quotes erd
2627 \end_inset
2628
2629  transaction commit (ie.
2630  syncless) which meant that data may be reverted on a crash.
2631 \end_layout
2632
2633 \begin_layout Subsubsection
2634 Proposed Solution
2635 \end_layout
2636
2637 \begin_layout Standard
2638 None.
2639 \end_layout
2640
2641 \begin_layout Standard
2642 Unfortunately any transaction scheme which overwrites old data requires
2643  a sync before that overwrite to avoid the possibility of corruption.
2644 \end_layout
2645
2646 \begin_layout Standard
2647 It seems possible to use a scheme similar to that described in
2648 \begin_inset CommandInset ref
2649 LatexCommand ref
2650 reference "sub:TDB-Does-Not"
2651
2652 \end_inset
2653
2654 ,where transactions are committed without overwriting existing data, and
2655  an array of top-level pointers were available in the header.
2656  If the transaction is
2657 \begin_inset Quotes eld
2658 \end_inset
2659
2660 soft
2661 \begin_inset Quotes erd
2662 \end_inset
2663
2664  then we would not need a sync at all: existing processes would pick up
2665  the new hash table and free list and work with that.
2666 \end_layout
2667
2668 \begin_layout Standard
2669 At some later point, a sync would allow recovery of the old data into the
2670  free lists (perhaps when the array of top-level pointers filled).
2671  On crash, tdb_open() would examine the array of top levels, and apply the
2672  transactions until it encountered an invalid checksum.
2673 \change_inserted 0 1284423555
2674
2675 \end_layout
2676
2677 \begin_layout Subsection
2678
2679 \change_inserted 0 1284423617
2680 Tracing Is Fragile, Replay Is External
2681 \end_layout
2682
2683 \begin_layout Standard
2684
2685 \change_inserted 0 1284423719
2686 The current TDB has compile-time-enabled tracing code, but it often breaks
2687  as it is not enabled by default.
2688  In a similar way, the ctdb code has an external wrapper which does replay
2689  tracing so it can coordinate cluster-wide transactions.
2690 \end_layout
2691
2692 \begin_layout Subsubsection
2693
2694 \change_inserted 0 1284423864
2695 Proposed Solution
2696 \begin_inset CommandInset label
2697 LatexCommand label
2698 name "replay-attribute"
2699
2700 \end_inset
2701
2702
2703 \end_layout
2704
2705 \begin_layout Standard
2706
2707 \change_inserted 0 1284423850
2708 Tridge points out that an attribute can be later added to tdb_open (see
2709
2710 \begin_inset CommandInset ref
2711 LatexCommand ref
2712 reference "attributes"
2713
2714 \end_inset
2715
2716 ) to provide replay/trace hooks, which could become the basis for this and
2717  future parallel transactions and snapshot support.
2718 \change_unchanged
2719
2720 \end_layout
2721
2722 \end_body
2723 \end_document