CouchDB is lots of fun. It's really easy to install on a mac using the CouchDBX package. It comes with a nice web UI so you can play around with it straight away. It leverages REST and JSON to provide a simple API that you can use from virtually any language. It has a great transactional model which lets you have full ACID semantics in a very lightweight way. So why don't I use it? Well, several reasons. I'll try to skip the standard flaming I've heard on the 'tubes before. Here goes…
Unfortunately you can only create a view from the original data, there is no way to create views whose input is other views. This means that you cannot do anything really interesting with values from multiple documents. You can aggregate data from several documents using the reduce functionality into buckets, but you can't process that data further.
This means that you have to live with the same limitations of SQL queries (the fact that they are non recursive, so they can't express transitive relationships), but you don't get the freedom to write queries ad hoc and have them execute efficiently (ad hoc views are supported, but there are no general purposes indexes).
The reduce functionality alleviates this somewhat, but personally I feel this is a bit of a kludge (reduce is really just a special case of map: map takes data and outputs data to buckets using a key. Reduce is a map where the data is the resulting buckets from the previous pass and the output).
The replication subsystem is also heavily hyped, but it's hard to find details about the way it actually works. My understanding is that each conflicting version is kept in storage, but that one of these "wins" and becomes the default version of a document. This is rationalized in the CouchDB technical overview as follows:
The CouchDB storage system treats edit conflicts as a common state, not an exceptional one
If I understand correctly since a conflict is not an error, without explicitly seeking out these conflicts your keep working with the "winner". From the user's point of view if your application is not defensive about conflicts but the user decides to deploy it with replication it could lead to apparent data loss (the data is still there, but not viewable in the application) and inconsistencies (if two different documents' "winners" have a conflicting assumption about the state of the database, without actually conflicting in the data, though if fully serializable transactions are used this might not be an issue).
In short, color me skeptical. The replication subsystem could be a useful start to building distributed apps, but there is still a lot of effort involved in doing something like that.
Out of the box replication support is useful for taking data sets home on your laptop as a developer, and being able to push changes back later. I see no compelling evidence for the claims about scalability and clustering.
To me this seems like a niche feature, not really relevant for most applications, but one in which significant effort was invested. The presence of a feature I don't quite care for doesn't really mean I shouldn't use something, but for a project which is still under heavy development this comes at the expense of more important features.
If I recall correctly CouchDB supports upwards of 2000 HTTP requests per second on commodity hardware, but this is only optimal if you have many concurrent dumb clients, wheras most web applications scale rather differently (a handful of server side workers, not thousands).
Even if you use non blocking clients the latency of creating a socket, connecting, requesting the data and waiting for it is very high. In KiokuDB's benchmarks CouchDB is the slowest backend by far, bested even by the naive plain file backend by a factor of about 2-3, and by the more standard backends (Berkeley DB, DBI) by a factor of more than 10. To me this means that when using KiokuDB with Berkeley DB backend I don't need to think twice about a request that will fetch several thousand objects, but if that request takes 5 seconds instead of half a second the app becomes unusable. Part of the joy of working with non linear schemas is that you can do more interesting things with tree and graph traversals, but performance must be acceptable. Not all requests need to fetch that many objects, but for the ones that do CouchDB is limiting.
If you have data dependencies, that is you fetch documents based on data you found in other documents this can quickly become a bottleneck. If bulk fetching and view cascades were supported a view that provides a transitive closures of all relevant data for a given document could be implemented by simply moving the graph traversal to the server side.
So even though CouchDB performs quite when measuring throughput it's quite hard to get low latency performance out of it. The simplicity gained by using HTTP and JSON is quickly overshadowed by the difficulties of using nonblocking IO in an event based or threaded client.
To be fair a large part of the problem is probably also due to AnyEvent::CouchDB's
lack of support for the bulk document API's include_docs feature (is that a recent addition?). KiokuDB's object linker supports bulk fetching of entries, so this could have the potential to make performance acceptable for OLTP applications requiring slightly larger transient data sets. Update: this has since been added to AnyEvent::CouchDB. I will rerun my benchmarks and post the results in the comments tomorrow.
No authentication or authorization
Authorization support could make a big performance difference for web applications. If the mechanisms to restrict access were in place the CouchDB backend could be exposed to the browser directly, removing the server side application code as a bottleneck.
If the server side could provide the client with some trusted token allowing it to only view (and possibly edit) a restricted set of documents. There is lots of potential in the view subsystem for creating a flexible authorization framework.
LDAP authentication is on the roadmap, but authentication and authorization are really separate features, and there doesn't seem to be any work toward flexible access control yet.
Apparent lack of development focus
I guess I have no business complaining about this since I don't actually contribute code, but it seems to me like the focus of the team was to improve what already exists, instead of adding important missing features (or at least features I feel are important). This makes me pessimistic about having any of the issues I raised resolved.
When I was last on the IRC channel there were discussions of a second rewrite of the on disk BTree format. Personally I would much rather see feature completeness first. Rewriting the on disk format will probably not provide performance improvements an order of magnitude better than the current state, so I think it's more than acceptable to let those parts remain suboptimal until the API is finalized, for instance. CouchDB's performance was definitely more than acceptable when I was using pre-rewrite, so this strikes me as a lack of pragmatism and priorities, especially when the project does have an ambitious roadmap.
We've been using the excellent Berkeley DB as well as SQLite and unfortunately MySQL for "document oriented storage", and all of these work very well. Connectivity support is fairly ubiquitous, and unlike CouchDB the APIs are already stable and complete.
Other alternatives worth exploring include MongoDB (which unfortunately lacks transactions), key/value pair databases (lots of these lately, many of them distributed), RDF triplestores, and XML databases.
One alternative I don't really consider viable is Amazon SimpleDB. It exhibits all of the problems that CouchDB has, but also introduces complete lack of data consistency, and a much more complex API. Unless you need massive scaling with very particular data usage patterns (read: not OLTP) SimpleDB doesn't really apply.
I think the most important thing to keep in mind when pursuing schema free data storage the "you are not google" axiom of scaling. People seem to be overly concerned about scalability without first having a successful product to scale. All the above mentioned technologies will go a long way both in terms of data sizes and data access rates, and by using a navigational approach to storing your data sharding can be added very easily.
Anyway, here's hoping CouchDB eventually matures into something that really makes a difference in the way I work. At the moment once the store and retrieve abstractions are in place there's nothing compelling me to try and use it over any other product, but it does show plenty of promise.