Monday, August 17, 2009

Gospel of Consistency

MySQL is configured for data loss out of the box[1]. Putting stuff "in the cloud" won't make it magically scalable; without understanding and overcoming the constraints imposed by cloud computing your app will likely not work at all. When CouchDB 0.9 removed multi document transactions I was confused about the technical reasons, but much of the feedback I received seemed to misunderstand my point.

It seems that consistency, especially in a distributed environment is a thoroughly misunderstood concept.

Consistent data is data that makes sense. Much like you can't just pull correct data out of your ass, you can't put inconsistent data back in there and "eventually" pull out something consistent. Corrupt data may not be immediately obvious, which is why consistency requires methodology.

Taking a guess at the reasons behind MySQL's utterly broken default, I suspect that in this case that it's not even an optimization. To a person that does not understand concurrency silent data loss is not immediately apparent, but having your transaction fail to commit is very scary if you don't understand what's going on.

The Meaning of Inconsistency

If you wanted to have your car fixed, you take it to a garage. A transaction will take place, which involves you paying money, and the mechanic fixing your car.

An inconsistent state would be either you paying money and the problem not being fixed, or the mechanic fixing your car and not getting compensated. A rolled back transaction puts you back where you started, with a broken car and some money to fix it.

Suppose you went to a mechanic, the mechanic took your money and then discovered that a necessary part is not in stock, that's a simple error to resolve. This is an expected failure.

If on the other hand the part installation requires an specialized tool she might not have, this error is a little more complicated, but can be easily detected in advance.

Lastly, if a dinosaur attacks the garage and eats your car, that's an even more complicated failure mode. Thankfully dinosaur attacks aren't very common.

Here is coarse grained transactional protocol for paying money to get a car fixed:

  1. Client arrives with cash, a safe (that has no value) and a broken car
  2. Mechanic estimates a quote
  3. Client places cash in the safe, retaining the key (if there is insufficient cash the transaction is rolled back)
  4. Mechanic keeps safe in garage, and proceeds to work, as client waits in the lobby. If some error occurs (e.g. car turns out to be unfixable, the transaction is rolled back)
  5. Client inspects work
  6. Client gives key to mechanic, who extracts the cash

Note that this protocol doesn't protect against dinosaur errors (much like enabling fsync in your database doesn't protect against disk failure), but does protect against the other failure modes by ensuring there are sufficient funds, and that the mechanic does not get access to payment until the job is completed successfully.

The point of this example is to demonstrate how cumbersome full ACID could be for a process that is generally expected to go over well. The cost of lugging around a safe in order to ensure your mechanic doesn't screw you over is unreasonable. If you paid and didn't get your money's worth you use a conflict resolution tool, such as a lawyer.

If every trip to the supermarket would require dragging a safe to escrow your money the world economy would grind to a halt.

Having a scalable system means balancing out the risk and cost of failures (expected and unexpected) vs. the cost of protecting against them. The garage would probably incur a small penalty for a delay (for instance giving you a small discount). They would also probably pay insurance against unlikely but foreseeable events, like a fire. Investing in a laser defense system is not that sensible.

What it boils down to is that when you hand your cash over, you want to be reasonably sure that you're going to get what you paid for. If your mechanic says "sorry, I can't work on this type of car" that's a conflict that has been detected up front, but if she says "it might take me a few more days because I need to order a part" you can get back to your life in the mean while.

In the real world we often use optimistic concurrency to reduce the overhead for the most likely outcome.

Eventual Consistency In Software

The way the term "Eventual Consistency" is tossed around these days reminds me of this line from the Princess Bride.

An efficient garage runs under an eventual consistency model. We employ various tactics to reach consistency in the end if there are surprises, but generally there won't be any. This works well because unlike software, humans are very adaptable to failure.

The key to implementing an eventually consistent system is that when you make a promise to do something (e.g. the cash handover) you should be damn sure you can finish the job, even if there are a few complications (e.g. ordering additional parts). Allowing for real failures are not a good option, because breaking a promise is much more complicated than simply not making one in the first place.

Note that you can fix inconsistencies after the fact, this is called a compensating action, but predicting the failure modes, testing for them, and fitting auditing for these repairs into the schema is usually a lot more complicated than making sure you wouldn't need them in the first place.

In traditional multithreaded programming you need to use mutexes to protect a critical section. You could use a single mutex to protect a large sequence (a combination of several steps), or you could use finer grained locking to protect each sub step separately. You still need locking to avoid race conditions for each critical sub-step.

For instance, suppose you're writing an OS kernel and you need to allocate a page of memory for a process. If your kernel is threaded you need to ensure that the updates to the free list and the allocated list have no race conditions (or you might corrupt the memory maps because of two simultaneous allocations). The easiest way to do that is to use a mutex for the duration of the update.

Next, consider the counters used by commands like top to report information about system or process memory. These are not critical resources. Nothing really depends on them providing the right value, so they could be momentarily inconsistent with the actual memory maps. Basically, there is no real benefit in keeping the critical memory map mutex until you've updated all the counters, it slows down the memory access for the program requesting the page, and for any other process needing to make an allocation.

The eventual consistency way to allocate memory would be for the kernel thread that allocates the memory page to simply ignore the counters. It would need to safely update memory map and then make a note of the change in some queue. This way the mutex is kept for a short duration.

Later, a low priority thread would aggregate the various updates from the queue, and bring the counters up to date, making them consistent with the memory maps.

This does not mean that the update to the memory map can be done without a mutex. The eventual consistency fairy will not come and magically fix the memory corruption in the event of a race condition. You also still needs to use a mutex to protect the counters and the queue.

BASE and ACID are not really opposites, despite the pun in the name. BASE is an alternative to using ACID for everything, but it doesn't completely replace ACID, it builds on it as a low level tool.

Eventual consistency is about avoiding database transactions that take unreasonably long or require unreasonable complexity (multi node synchronization) by splitting them up into smaller localized steps. This also allows you to return results to the client before all the steps have been finished.

This means that the client might send the next request before all the steps have been finished, and get a result that is out of date, but this inconsistency must be fixed eventually. If it won't, that's a bug (eventual consistency still implies consistency, without it the consistency model is a weak one).

If the client tries to make a conflicting change, the steps must to be carried out in the correct order, and on the correct shard (the one that is authoritative for the data being changed), so that conflicts can never happen in any transactional step except the first one. By partitioning the data wisely you can be sure that such changes can always be made on a single master, and thus you can avoid the need for distributed transactions.

ACID transactions are like scope of a mutex protected operation. You still need some ACID guarantees to implement BASE correctly, but BASE methodologies are careful to avoid unnecessary contention and complexity by keeping the transactions small and localized.

Compromising on consistency this way means that some database reads could temporarily return stale data. To properly deal with that, avoid using stale data as a basis for future write, or make sure your infrastructure won't let you do that by refusing to commit your transactions. Either the read for an update or both the read and the write for an update needs to go through to the single authoritative copy of that data.

Eventual consistency is not simpler than ACID, at least not for the programmer. It lets you build a scalable architecture from simpler primitives. If databases could work out what parts of the transactions can lag then BASE could be automated, but they can't, so it's the responsibility of the implementors to draw the lines in the right places.

In the kernel example, the naive threading approach was replaced by a more complex event driven model (which is still multithreaded). Event oriented programming may require managing more components, but forcing an ill-suited paradigm to scale is a much harder problem to solve.

Lastly, keep preemptive error detection in mind. Your system needs to deal with failure and detecting it ahead of time is the easiest way to do that. In the real world we are much more adaptable and can assess the situation to fix these problems as they arise, but software is made of fail, so it should stay simple and predictable. If you need to implement complex compensating actions you might as well go with a truly multi master infrastructure.

CouchDB 0.9

A while ago I wrote that CouchDB doesn't fit my needs well, based on my experience with version 0.8. What I didn't know is that CouchDB 0.9 had been out for a while, and while this version fixed some of my concerns, it also took away the one feature I really liked: the beautifully lightweight transactions.

In version 0.8 whenever you did a bulk update/insert of several documents at the same time, commits would be atomic. This means that if any single document had a conflict, the whole operation would fail. In version 0.9 only that document results in a conflict.

What this means is that the atomicity guarantee can no longer be provided for sets of documents, if your documents are co-dependent you might end up with inconsistencies in your database. Transactions are scoped to a single document only.

While some of my past criticism has been resolved and other parts still stand, I think this change is much more crucial when deciding whether or not to use CouchDB.

The rationale behind the removal is that it's too difficult to implement multi document transactions in a way that supports replication and transparent sharding, so to make sure that projects that start development on a single CouchDB server can scale to a cluster architecture, they need to deal with the transactional limitations early on.

Transparent Sharding

In my aforementioned email I asked why an explicit opt-in atomicity flag couldn't be introduced. At least in principle the implementation of such a flag is not very complicated. The problem is that in order for this to be useful, the sharding needs to be implemented such that the set of documents lives on a single node.

One of the design goals of CouchDB is that clustering should embed a minimal amount of knowledge about how the cluster is administered into the client. In short, the application code is supposed to be able to stay oblivious to the sharding, and therefore scale transparently if necessary.

The explicit removal of this feature implies that the "correct" sharding model is that of a distributed hash table. Even with consistent hashing the selection of the node responsible for storing a set of documents is supposed to be opaque.

In light of this change, to properly implement eventual consistency on top of CouchDB you need to update your documents in the correct order. If you update a document and get a conflict, all related updates should be aborted. However, if this document succeeds you need to be sure that all related updates can be carried out successfully. More specifically, any related update that fails due to a conflict must be simple enough to retry with the updated version.

The ACID guarantees you get from CouchDB only extend to a single document, so that is the coarsest granularity you can use. To implement a BASE like change workflow with CouchDB you need to make sure that every data update operation that affects multiple documents will only conflict in the first document that you update, or be prepared to implement compensating actions to resolve conflicting versions of documents after the fact.

Secondly, once you've issued that first crucial update, you also need to be sure that all the related updates are still run. If you already returned a result to the client and these actions run in batch, this can get tricky. BASE requires reliable message handling as well, using transactional message queue.

That last bit is what troubles me most about CouchDB. Designing a data model whereby updates must be queued transactionally but transactions are limited to one document is rather tricky.

Reducing Contention

The CouchDB savvy way to implement deferred updates is actually to avoid issuing updates at all. The view subsystem lets you aggregate data and do simple inter-document data munging. CouchDB ensures that views will be synchronized to the state of the database, and the purely functional approach prevents conflicts. Hopefully the view system will be generalized to support multiple levels of mapping and views derived from other views.

I've already written about the benefits of append only models for scaling (and see also first and third posts on that topic).

If most data operations involve creation of new documents (a process that shouldn't fail) and simple updates after those have succeeded it then you can make better use of the ACID guarantees CouchDB provides. Unsurprisingly, this is precisely the approach the book recommends.

If views are not enough for you, you can obviously use some sort of message queue, but the act of dequeuing a message and applying its effect should be protected by a single transaction. If a deferred update is not executed or executed more than once, the data will end up inconsistent (unless the update is idempotent)[2].

If you reduce all contention down to a single document you can effectively use CouchDB without sacrificing the consistency of your data.


I still don't use CouchDB for work. My data models tend to be fairly interlinked so working around the lack of multi document transactions outweighs the potential benefits of using CouchDB. It's just not a good fit for writing KiokuDB code that works with other backends well. That doesn't mean you shouldn't use CouchDB, but it means that if your data is heavily interlinked if you introduce concurrency your data might get corrupted.

If you look past the hype, CouchDB doesn't relieve you of thinking about how to design your schema. Of course, there is nothing wrong with that. Like any tool, you still need to know what it's doing to use it correctly.

Secondly, back in Perl land I write all my transactional code in such a way that failure is an option. Either transactions can fail to commit, or you need to use locks, or some data will eventually get overwritten. Pick any one, and choose or configure your database accordingly.

Lastly, I don't expect scalability to magically happen on its own. There is no secret formula that lets you wish the difficulties away. With good habits you can make the transition happen more easily, but I think it's still vital to understand how everything works under the hood, so that you know if its viable, and so that you can overcome the limitations successfully.

Framing this in terms of Werner Vogel's often linked post, you need to design a schema that will work with the consistency guarantees that your infrastructure provides (e.g. casual consistency, monotonic write consistency, etc)[3].

See Also

Distributed systems are a rich topic. What it always boils down to is achieving an approximation of the global ordering of events. Distributed systems exist because we simply can't scale to handle everything serially, in a single point of failure. If the global ordering is total then replication adds redundancy, but reduces scalability.

The compromise always revolves around finding a partial ordering for your operations that still produces consistent data, so that even if the variations happen in a order of events as observed by the distributed parts, the operations that do depend on each other are always handled in sequence and the result is the same everywhere.

The partial order can be negotiated by passing explicit or implicit synchronization tokens to the distributed components, waiting until some consensus is reached for global transactions, or completely localizing authority over a resource (as in BASE), so that the order of events that affects a piece of data is determined node responsible for that resource.

[1] Under the default isolation level, if your issue a SELECT statement, and uses the results to create a separate UPDATE statement in the same transaction MySQL will simply overwrite any updates made by other transactions in the mean while. The SELECT and the UPDATE are not a single atomic operation to outside observers. Of course, this isn't really a problem since everyone uses SELECT FOR UPDATE, right?

[2] Safe updates can be done by creating a new document for every update operation, essentially treating CouchDB as a message queue, and keeping track of which updates have been applied in the document they apply to. If you don't need auditing, handled updates can then be deleted, and the list of applied update IDs can be garbage collected. These operations are possible to implement with single document atomicity guarantees.

[3] Ironically I can't actually find any documentation about the guarantees that SimpleDB makes. I've head vector clocks being mentioned, they can be used to detect conflicts post factum, but they can't fix them so I'm not sure how a writer would be alerted of conflicts found during replication. Anybody got a clue?


Unknown said...

I think you're mis-characterizing isolation levels a bit. The serializable level is typically considered safer than repeatable read because it behaves as if the transactions had all occurred serially. It prevents phantom reads.

I'm not following why it would cause a transaction with a SELECT and UPDATE to "overwrite" data. It should block other transactions until commit, and possibly abort other transactions that wait too long, but never overwrite committed data.

Most people are better off running at repeatable read, which is less safe, but is more obvious to developers, e.g. you don't need to initiate a new transaction in order to see data committed by other connections.

nothingmuch said...

Well, the problem is that mysql defaults to read committed isolation level.

I think if someone understands repeatable reads then they should feel free to use them, but the distinction is very tricky to establish.

If you are using two select queries to set up relationship data in a single transaction, that could still be a problem.

This of course is *extremely* unlikely even with high concurrency and somewhat longish transactions, but if it does actually happen that can be very confusing.

Personally almost all of my selects for update are done using an explicit primary key, so to someone like me that won't be a difference at all, but conceptually I find it easier to reason about the serializable isolation level and it has never been a performance problem.

nothingmuch said...

I also think that the issue you raised with needing to start a new transaction is less of a concern if transactions are generally short lived, which is usually the case in web apps.

Unknown said...

Actually, we both remembered it wrong: MySQL defaults to repeatable read, and I meant to say that it's better to run at read committed. That's the default for Oracle and Postgres.

nothingmuch said...

Hmm, i can see that in the docs too, but the KiokuDB test suite seems to think otherwise.

It passes on BDB and SQLite but not with MySQL 5.0 as configured out of the box.

There are no fetches except for primary key fetches. I'm going to look into this more some time today.

To reproduce give KiokuDB::Backend::DBI's t/fixtures.t a MySQL DSN and set the env var KIOKUDB_STRESS_TEST=1. Also fails with Pg, but not SQLite.

KIOKUDB_STRESS_TEST enables a concurrency test that ensures that ensures even balances at the end.

The test only uses primary key fetches so it should work with repeatable reads or serializable isolation.

Hercynium said...

I have an interest in scalable distributed transaction processing systems and I recently read this article:

I have no idea whether or not the techniques described in the article are available in any FOSS software but I especially find it interesting that it declares that commit ordering of transactions across heterogeneous databases is essentially a solved problem.

Clearly, that is not the type of problem being described in your post here, but my point is this: If consistency can be guaranteed in the more complex, general case, it should be somewhat less difficult to implement in a homogeneous cluster like what CouchDB supports.

It seems to me that CouchDB wants to invent it's own solution, or perhaps simply does not plan on supporting the type of consistency that would be desired from more traditional databases.

nothingmuch said...

Click on the BASE link, that's an article written by eBay's DBA, it has nothing to do with CouchDB. I went into a little detail about CouchDB because I think it's far less understood than traditional DB's.

CO ensures that contention is dealt with simply, because the conflicting resources are accessed according to some known precedence.

The order in which transactions are actually committed is not necessarily the order clients asked to commit them. By determining an order for resources contention can be reduced, and more transactions can be committed concurrently, simplifying the consistency checking in such a way that transactions never block when accessing (like MVCC as opposed to locking a shared resource), but the need to abort transactions is reduced by committing them in an optimal order.

This image explains the difference well. Both transactions want to access the resource x, but under a strict locking mode T2 has to block as it's running (and the lock that blocks it has to be negotiated), causing the total runtime of both transactions to be longer and reducing concurrency.

However, that final commit still requires two phase atomic commit for distributed transactions.

CO is relevant to systems where concurrent transactional throughput is the main concern (while maintaining total consistency), whereas eventual or weak consistency models are designed to address latency. It can increase the total number of transactions per second compared to more primitive global serialization algorithms, especially if transactions span only a subset of the cluster.

However, this still requires negotiation of the final commit between all involved replicas, something that BASE (if done right) does not.

The easiest way to ensure serializability of transactions is a single global lock (or actually just strict two phase locking). If two transactions touch different data they will never conflict with each other can be run at the same time. CO ensures correctness even if all data access operations are nonblocking, and only the final commit is blocking.

Eventual consistency and global commitment ordering are completely orthogonal, and can complement each other in a system where some distributed data needs global atomicity and other distributed data does not.

Siberiano said...

I failed to read up to the place that tells about the very problem of data loss and what's wrong with default.

nothingmuch said...

there's a convenient link to the footnote that explains it =P

nothingmuch said...

I wrote a test script to check MySQL's behavior.

Though the default repeatable read isolation level may indeed return the same data on selects, this has absolutely no implications with respect to detecting conflicts.

The updates are not atomic with the selects (except if FOR UPDATE is added), even if the selects would read from a snapshot.

Unknown said...

I agree: if you're not in serializable isolation level and you don't use SELECT FOR UPDATE, a SELECT will not lock rows to prevent other updates that happen after your select. This allows you to choose when you actually want to lock the rows and when you just want to read them atomically without locks (which is the common case). Using serializable isolation turns all SELECTS into SELECT FOR UPDATE.

However, saying that MySQL is configured for data loss because of this is misleading because Oracle and Postgres (and possibly many others) are also configured this way by default.

nothingmuch said...

You're right.

The first sentence should probably read "Most RDBMSs" or something like that, but MySQL is easy prey.

My point is not so much that MySQL is broken (I have other, better reasons like the fact that it silently truncates BLOB columns, for instance).

The problem is that people use their DBs incorrectly; even if they are careful to use transactions (which unfortunately is often not the case) most web app code still potentially sensitive to race conditions.

I've yet to see one instance of FOR UPDATE or enabling of serializable isolation in real code I've had to maintain.

People also don't retry failed transactions, they rarely even check for commit failures. At least if one is generated that becomes a simple error page, instead of a problem in the DB.

The good news is that web apps are not very sensitive to such race conditions, they are pretty rare in practice, but with Ajax techniques it's quite possible to accidentally trip them and end up with non obvious corruption.