Wednesday, September 23, 2009

Testing for Race Conditions

Concurrent programming is hard for many reason. One of them is that scheduling isn't predictable; the same code may behave differently depending on the OS's scheduling decisions. This is more complicated on single CPU machines when there is no real concurrency, just the illusion of concurrency through time sharing.

Because many operations can happen in a single time slice (which is usually 10ms on most operating systems) unless the race condition falls on the time slice boundary it may go undetected in testing.

My laptop can do several thousand IO operations and many more thread mutex operations per second. Chances are only if one of these has a race condition it'll take very long to find it accidentally.

Since most concurrent code tries to minimize the duration of critical sections, the probability of critical sections interleaving with context switches can be very low. On a single CPU machine all the non blocking operations performed in a single time slice are atomic from the point of view of other processes. For a computer 10ms is a pretty long length of time.

Therefore, when testing for concurrency bugs it's important to introduce artificial contention to increase the chance of exposing a race condition. A fairly reliable method of doing that which I use in Directory::Transactional's test suite is introducing random delays into the process:

use Time::HiRes qw(sleep);
use autodie;

sub cede {
    sleep 0.01 if rand < 0.2;
}

for ( 1 .. $j ) {
    fork and next;

    srand $$;

    for ( 1 .. $n ) {
        # critical section
        cede();
        acquire_resource();
        cede();
        read_resource();
        cede();
        write_resource();
    }

    exit;
}

wait_all();

This is very easy to add only during testing by wrapping the resource access APIs in your codebase.

If you concurrently loop through the critical section long enough the probability of finding a race condition is much higher than without the delays that encourage the OS to perform context switching even on a single CPU machine.

This is by no means an exhaustive method of checking all possible context switch permutations. However, by increasing the probability of interleaved context switches from infinitesimally small to pretty good for a reasonable number of iterations we are much more likely to trigger any race condition (as opposed to all of them).

Furthermore, by designing volatile test transactions the chance of detecting this race condition is also quite good (since an exposed race condition would be more likely to cause inconsistency). Specifically in Directory::Transactional the test simulates money transfers between three accounts, where each iteration deducts a random number from a random account and adds it to another.

The values are read before a delay, and written afterwords, so any other process touching the same account would trigger a race condition if proper locking was not in place. The fact that the values are random too increases the chance of corruption being detectable (it's much more likely the accounts will not be balanced if the value is random as opposed to some constant).

Coupled with concurrent consistency checks on the data done at intervals this was a pretty effective method for detecting race conditions, quickly exposing initial implementation flaws in Directory::Transactional that took very many iterations to reproduce at first. Throwing in random kill -9 signals provided a fun test for journal replay recovery.

3 comments:

Anonymous said...

You might also be interested in the Relacy Race Detector.

http://groups.google.com/group/relacy/web

nothingmuch said...

Cool, that is definitely a more powerful approach.

Oviously wrapping CORE:: functions or APIs with a scheduler of some sort is a good approximation, but ultimately this is the way to go. Too bad Perl's threading is so sucky.

It could probably be ported to Coro more easily, but Coro is less prone to race conditions so it might have little usefulness.

I wonder what the exhastive search times are.

nothingmuch said...

Also, http://blog.woobling.org/2009/05/your-openid-sucks.html