Rusty Russell [Fri, 9 Apr 2010 01:24:31 +0000 (10:54 +0930)]
From: Joseph Adams <joeyadams3.14159@gmail.com>
The btree patch gives the btree module an intuitive frontend
(btree_insert, btree_remove, btree_lookup) and a built-in ordering
function for strings. Together, these make it easy to use the btree
module as a dynamic string map.
Rusty Russell [Wed, 24 Feb 2010 03:38:40 +0000 (14:08 +1030)]
tdb: handle processes dying during transaction commit.
tdb transactions were designed to be robust against the machine
powering off, but interestingly were never designed to handle the case
where an administrator kill -9's a process during commit. Because
recovery is only done on tdb_open, processes with the tdb already
mapped will simply use it despite it being corrupt and needing
recovery.
The solution to this is to check for recovery every time we grab a
data lock: we could have gained the lock because a process just died.
This has no measurable cost: here is the time for tdbtorture -s 0 -n 1
-l 10000:
Rusty Russell [Wed, 24 Feb 2010 03:24:31 +0000 (13:54 +1030)]
tdb: remove lock ops
Now the transaction code uses the standard allrecord lock, that stops
us from trying to grab any per-record locks anyway. We don't need to
have special noop lock ops for transactions.
This is a nice simplification: if you see brlock, you know it's really
going to grab a lock.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Tue, 23 Feb 2010 22:22:44 +0000 (08:52 +1030)]
tdb: suppress record write locks when allrecord lock is taken.
Records themselves get (read) locked by the traversal code against delete.
Interestingly, this locking isn't done when the allrecord lock has been
taken, though the allrecord lock until recently didn't cover the actual
records (it now goes to end of file).
The write record lock, grabbed by the delete code, is not suppressed by
the allrecord lock, which causes us to punch a hole in that lock when we
release the write record lock. Make this consistent: *no* record locks
of any kind when the allrecord lock is taken.
Rusty Russell [Mon, 22 Feb 2010 12:33:36 +0000 (23:03 +1030)]
tdb: don't reduce file size on transaction recovery.
There's little point in ever shrinking the file, and it definitely breaks in the case where a process has died during a transaction commit and other processes have the tdb mapped.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If a process (or the machine) dies after just after writing the
recovery head (pointing at the end of file), the recovery record will filled
with 0x42. This will not invoke a recovery on open, since rec.magic
!= TDB_RECOVERY_MAGIC.
Unfortunately, the first transaction commit will happily reuse that
area: tdb_recovery_allocate() doesn't check the magic. The recovery
record has length 0x42424242, and it writes that back into the
now-valid-looking transaction header) for the next comer (which
happens to be tdb_wipe_all in my tests).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Mon, 22 Feb 2010 04:33:23 +0000 (15:03 +1030)]
tdb: cleanup: remove ltype argument from _tdb_transaction_cancel.
Now the transaction allrecord lock the standard one, and thus is cleaned
in tdb_release_extra_locks(), _tdb_transaction_cancel() doesn't need to
know what type it is.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Centralize locking of all chains of the tdb; rename _tdb_lockall to
tdb_allrecord_lock and _tdb_unlockall to tdb_allrecord_unlock, and
tdb_brlock_upgrade to tdb_allrecord_upgrade.
Then we use this in the transaction code. Unfortunately, if the transaction
code records that it has grabbed the allrecord lock read-only, write locks
will fail, so we treat this upgradable lock as a write lock, and mark it
as upgradable using the otherwise-unused offset field.
One subtlety: now the transaction code is using the allrecord_lock, the
tdb_release_extra_locks() function drops it for us, so we no longer need
to do it manually in _tdb_transaction_cancel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Mon, 22 Feb 2010 03:57:13 +0000 (14:27 +1030)]
tdb: cleanup: always grab allrecord lock to infinity.
We were previously inconsistent with our "global" lock: the
transaction code grabbed it from FREELIST_TOP to end of file, and the
rest of the code grabbed it from FREELIST_TOP to end of the hash
chains. Change it to always grab to end of file for simplicity.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Mon, 22 Feb 2010 03:36:02 +0000 (14:06 +1030)]
tdb: use tdb_nest_lock() for open lock.
This never nests, so it's overkill, but it centralizes the locking into
lock.c and removes the ugly flag in the transaction code to track whether
we have the lock or not.
Note that we have a temporary hack so this places a real lock, despite
the fact that we are in a transaction.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Mon, 22 Feb 2010 03:03:53 +0000 (13:33 +1030)]
tdb: don't suppress the transaction lock because of the allrecord lock.
tdb_transaction_lock() and tdb_transaction_unlock() do nothing if we
hold the allrecord lock. However, the two locks don't overlap, so
this is wrong.
This simplification makes the transaction lock a straight-forward nested
lock.
There are two callers for these functions:
1) The transaction code, which already makes sure the allrecord_lock
isn't held.
2) The traverse code, which wants to stop transactions whether it has the
allrecord lock or not. There have been deadlocks here before, however
this should not bring them back (I hope!)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Mon, 22 Feb 2010 03:03:27 +0000 (13:33 +1030)]
tdb: cleanup: tdb_nest_lock/tdb_nest_unlock
Because fcntl locks don't nest, we track them in the tdb->lockrecs array
and only place/release them when the count goes to 1/0. We only do this
for record locks, so we simply place the list number (or -1 for the free
list) in the structure.
To generalize this:
1) Put the offset rather than list number in struct tdb_lock_type.
2) Rename _tdb_lock() to tdb_nest_lock, make it non-static and move the
allrecord check out to the callers (except the mark case which doesn't
care).
3) Rename _tdb_unlock() to tdb_nest_unlock(), make it non-static and
move the allrecord out to the callers (except mark again).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Mon, 22 Feb 2010 02:59:36 +0000 (13:29 +1030)]
tdb: cleanup: rename GLOBAL_LOCK to OPEN_LOCK.
The word global is overloaded in tdb. The GLOBAL_LOCK offset is used at
open time to serialize initialization (and by the transaction code to block
open).
Rename it to OPEN_LOCK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
With the ctdb checkin dde9f3f006 tdb optimized out write lock checks for
write-enabled transaction. Sadly, this also removed the possibility to ever
remove dead records left over from tdb_delete calls within a transaction.
Tridge, please check this! Did dde9f3f006 have any reason beyond performance
optimizations?
(ending the transaction-"mutex") was done before the
/* remove the recovery marker */
This means that when a transaction is committed there is a window where another
opener of the file sees the transaction marker while the transaction committer
is still fully functional and working on it. This led to transaction being
rolled back by that second opener of the file while transaction_commit() gave
no error to the caller.
This patch moves the F_UNLCK to after the recovery marker was removed, closing
this window.
Rusty Russell [Fri, 25 Sep 2009 05:03:08 +0000 (14:33 +0930)]
ccanlint: compile and run tests.
This means we should skip building if there are no C files in module: running tests requires building the module, but not necessarily that it has any C files.
Rusty Russell [Fri, 25 Sep 2009 03:16:08 +0000 (12:46 +0930)]
tdb: allow reads after prepare commit
(Imported from SAMBA commit 46c99ec2a3781d53138245182345c6c2ddd9a258:
From: Andrew Tridgell <tridge@samba.org>
Date: Tue, 15 Sep 2009 14:04:22 -0700
We previously only allowed a commit to happen after a prepare
commit. It is in fact safe to allow reads between a prepare and a
commit, and the s4 replication code can make use of that, so allow it.
Rusty Russell [Sat, 12 Sep 2009 01:27:44 +0000 (10:57 +0930)]
Build tests for ccan.
More sophisticated skipping: skip dependencies when one fails as well.
Allow tests to change their total_score; only access it after running (other than to check it's non-zero).