tdb2: remove: it's now in SAMBA where it belongs.
tdb2: add a capability list from the header. This allows even more extensibility in future: in particular, the top bits of each capability tell us what to do if we don't understand it: fail the open, fail to open for write, or don't try to check the format. tdb_check needs to understand the capability list so it can know to skip over it: each element in the list is prefixed with the type tag and the length.
tdb2: provide tdb_layout_write() rather than implying it by new_tdb_layout arg. Neater API.
tdb2: merge tdb1_context into tdb_context. Finally, we split out the tdb2-specific parts of tdb_context, and put them into a "tdb2" sub-struct; the tdb1 parts go into a "tdb1" sub-struct. We get rido of tdb1_context and use tdb_context everywhere.
tdb2: fix intermittant failure in run-50-multiple-freelists-fail.c layout.c's TDB creation functions were incorrect in case of a hash collision, causing occasional failure. Make it always use the (previously-failing) seed value, and fix it.
tdb2: use counters to decide when to coalesce records. This simply uses a 7 bit counter which gets incremented on each addition to the list (but not decremented on removals). When it wraps, we walk the entire list looking for things to coalesce. This causes performance problems, especially when appending records, so we limit it in the next patch: Before: $ time ./growtdb-bench 250000 10 > /dev/null && ls -l /tmp/growtdb.tdb && time ./tdbtorture -s 0 && ls -l torture.tdb && ./speed --transaction 2000000 real 0m59.687s user 0m11.593s sys 0m4.100s -rw------- 1 rusty rusty 752004064 2011-04-27 21:14 /tmp/growtdb.tdb testing with 3 processes, 5000 loops, seed=0 OK real 1m17.738s user 0m0.348s sys 0m0.580s -rw------- 1 rusty rusty 663360 2011-04-27 21:15 torture.tdb Adding 2000000 records: 926 ns (110556088 bytes) Finding 2000000 records: 592 ns (110556088 bytes) Missing 2000000 records: 416 ns (110556088 bytes) Traversing 2000000 records: 422 ns (110556088 bytes) Deleting 2000000 records: 741 ns (244003768 bytes) Re-adding 2000000 records: 799 ns (244003768 bytes) Appending 2000000 records: 1147 ns (295244592 bytes) Churning 2000000 records: 1827 ns (568411440 bytes) After: $ time ./growtdb-bench 250000 10 > /dev/null && ls -l /tmp/growtdb.tdb && time ./tdbtorture -s 0 && ls -l torture.tdb && ./speed --transaction 2000000 real 1m17.022s user 0m27.206s sys 0m3.920s -rw------- 1 rusty rusty 570130576 2011-04-27 21:17 /tmp/growtdb.tdb testing with 3 processes, 5000 loops, seed=0 OK real 1m27.355s user 0m0.296s sys 0m0.516s -rw------- 1 rusty rusty 617352 2011-04-27 21:18 torture.tdb Adding 2000000 records: 890 ns (110556088 bytes) Finding 2000000 records: 565 ns (110556088 bytes) Missing 2000000 records: 390 ns (110556088 bytes) Traversing 2000000 records: 410 ns (110556088 bytes) Deleting 2000000 records: 8623 ns (244003768 bytes) Re-adding 2000000 records: 7089 ns (244003768 bytes) Appending 2000000 records: 33708 ns (244003768 bytes) Churning 2000000 records: 2029 ns (268404160 bytes)
tdb2: don't start again when we coalesce a record. We currently start walking the free list again when we coalesce any record; this is overzealous, as we only care about the next record being blatted, or the record we currently consider "best". We can also opportunistically try to add the coalesced record into the new free list: if it fails, we go back to the old "mark record, unlock, re-lock" code. Before: $ time ./growtdb-bench 250000 10 > /dev/null && ls -l /tmp/growtdb.tdb && time ./tdbtorture -s 0 && ls -l torture.tdb && ./speed --transaction 2000000 real 1m0.243s user 0m13.677s sys 0m4.336s -rw------- 1 rusty rusty 683302864 2011-04-27 21:03 /tmp/growtdb.tdb testing with 3 processes, 5000 loops, seed=0 OK real 1m24.074s user 0m0.344s sys 0m0.468s -rw------- 1 rusty rusty 836040 2011-04-27 21:04 torture.tdb Adding 2000000 records: 1015 ns (110551992 bytes) Finding 2000000 records: 641 ns (110551992 bytes) Missing 2000000 records: 445 ns (110551992 bytes) Traversing 2000000 records: 439 ns (110551992 bytes) Deleting 2000000 records: 807 ns (199517112 bytes) Re-adding 2000000 records: 851 ns (199517112 bytes) Appending 2000000 records: 1301 ns (376542552 bytes) Churning 2000000 records: 2423 ns (553641304 bytes) After: $ time ./growtdb-bench 250000 10 > /dev/null && ls -l /tmp/growtdb.tdb && time ./tdbtorture -s 0 && ls -l torture.tdb && ./speed --transaction 2000000 real 0m57.403s user 0m11.361s sys 0m4.056s -rw------- 1 rusty rusty 689536976 2011-04-27 21:10 /tmp/growtdb.tdb testing with 3 processes, 5000 loops, seed=0 OK real 1m24.901s user 0m0.380s sys 0m0.512s -rw------- 1 rusty rusty 655368 2011-04-27 21:12 torture.tdb Adding 2000000 records: 941 ns (110551992 bytes) Finding 2000000 records: 603 ns (110551992 bytes) Missing 2000000 records: 428 ns (110551992 bytes) Traversing 2000000 records: 416 ns (110551992 bytes) Deleting 2000000 records: 741 ns (199517112 bytes) Re-adding 2000000 records: 819 ns (199517112 bytes) Appending 2000000 records: 1228 ns (376542552 bytes) Churning 2000000 records: 2042 ns (553641304 bytes)
tdb2: move mmap into struct tdb_file It makes sense to share the mmap between multiple openers.
tdb2: make sure records with extra padding have a 0 byte. As detailed in doc/design.lyx section 2.16 "Record Headers Are Not Expandible", we make sure that if there is padding at the end of a record, the first byte of padding is a zero.
tdb2: avoid writing uninitialized bytes in test/layout.c
tdb2: fix leak in tests.
tdb2: use separate magic constants for chain, htable and ftable entries Rather than overloading TDB_USED_MAGIC and the hash value as we do now. We also rename "free list" to the more-accurate "free table" everywhere.
tdb2: shrink free header from 32 to 24 bytes. This reduces our minimum key+data length to 8 bytes; we do this by packing the prev pointer where we used to put the flist pointer, and storing the flist as an 8 bit index (meaning we can only have 256 free tables). Note that this has a perverse result on the size of the database, as our 4-byte key and 4-byte data now fit perfectly in a minimal record, so appeding causes us to allocate new records which are 50% larger, since we detect growing. Current results of speed test: $ ./speed 1000000 Adding 1000000 records: 23210 ns (59193360 bytes) Finding 1000000 records: 2387 ns (59193360 bytes) Traversing 1000000 records: 2150 ns (59193360 bytes) Deleting 1000000 records: 13392 ns (59193360 bytes) Re-adding 1000000 records: 11546 ns (59193360 bytes) Appending 1000000 records: 29327 ns (91193360 bytes) Churning 1000000 records: 33026 ns (91193360 bytes) Previous: $ ./speed 1000000 Adding 1000000 records: 28324 ns (67232528 bytes) Finding 1000000 records: 2468 ns (67232528 bytes) Traversing 1000000 records: 2200 ns (67232528 bytes) Deleting 1000000 records: 13083 ns (67232528 bytes) Re-adding 1000000 records: 16433 ns (67232528 bytes) Appending 1000000 records: 2511 ns (67232528 bytes) Churning 1000000 records: 31068 ns (67570448 bytes)
tdb2: rename set_header to the more appropriate set_used_header.
tdb2: handle chains of free tables This adds chains of free tables: we choose one at random at the start and we iterate through them when they are exhausted. Currently there is no code to actually add a new free table, but we test that we can handle it if we add one in future.
tdb2: get rid of zones Zones were a bad idea. They mean we can't simply add stuff to the end of the file (which transactions relied upon), and there's no good heuristic in how to size them. This patch reverts us to a single free table.
tdb2: remove tailer We don't actually need it.
tdb2: fix gcc -O3 warnings on test/layout.c Warnings about ignored returns, and uninitialized len after case statement.
tdb2: extend test/layout to be able to place in file. This was for lockcheck, but that didn't work very well. Seems like a useful addition nonetheless.
tdb2: change to using a hash tree. As the locking issues with enlarging a hash were so nasty, we switch to a tree structure for the entries. It's a hash which expands to point to sub-hashes when it fills. This means we no longer have a 'volatile' header: the top hash cannot move. In fact, we no longer store a copy of the header in the tdb_context; we only need hash_seed. New helper functions for accessing writable areas and committing the results (if they had to be copied). New debug test to make sure we don't hold access while we're doing something which could cause us to unmap/remap. Find becomes more complicated: we need to track where we found (or didn't find) an entry so we can easily add/delete it. Traverse becomes more complicated: we need to track where we were in the hash tree.