Some notes after my browsing through the source:
- It uses hinted handoff and bootstrapping just like Amazon's Dynamo
- Its consistency model doesn't seem to be quite the same - Dynamo uses vector clocks to determine causal relationships, whereas Cassandra seems to be just based on timestamps and "majority rules" semantics when timestamps are tied.
- Membership is communicated by a gossip protocol as described in the Dynamo paper.
- Requests are made to the system by sending thrift calls to any node. The thrift interface is included in the source.
Some further thoughts:
- It doesn't seem like there's a lock on the table during bootstrapping. What happens to mutations made on the source node while it is bootstrapping the destination? Are they marked for later hinted handoff?
- Would system performance be improved by using the new Thrift TNonblockingServer (see THRIFT-5 on JIRA)? It should be more scalable than the TThreadPoolServer they're using now.
- Cassandra is around 40K lines of Java. How many lines would an equivalent Erlang program be, and what would be the performance difference?
All in all, it's a very interesting project sure to attract much attention. Now that Powerset has been acquired by Microsoft, I'm a little worried for Hbase's future -- two of the three main developers are Powerset employees. Maybe Cassandra can help fill the open source scalable database niche.