All good things must come to an end, including this 3 part blog post series. In this post, we’ll dive into one of the database systems I am not hugely familiar with, Apache Cassandra, and it’s AWS counterpart, Keyspaces. What is Cassandra, then? It’s an open-source distributed, wide column data store that is capable of providing extreme read and write performance for massive datasets, and delivering scalability and high-availability by forming a cluster from multiple nodes.
Cassandra clusters are also notoriously difficult to manage, with complex scale out and even more complex rollback operations. There are also a bunch of horror stories about the error-prone restoration mechanism, and patching operations of clusters gone horribly wrong. Moreover, it’s lacking a few things, like encryption support, and you do need to learn a new query language (CQL) to make use of it.
Luckily for us, AWS has something for most of the Cassandra pains. Read on, to learn about AWS Keyspaces.
About purpose-built databases.
On the first part of this blog series, I go through some common examples of purpose-built databases. To avoid repeating myself too much, just follow this link to learn more about them.
What is Keyspaces?
AWS Keyspaces is a managed database offering that is Cassandra compatible. As a fully managed service, it removes the need to patch your servers or distributed database services, which is one of the more complex topics of managing Cassandra. The tables in Keyspaces are also encrypted by default, and replicated to multiple Availability Zones for 99.99% availability SLA within a region. Keyspaces is also serverless, allowing you to pay only for the resources you consume.
While it does provide a good level of compatibility to Cassandra, in terms of CQL and development tools that are commonly used. There are few things that do not exist in Keyspaces, such as some consistency levels. There are also some limitations that you need to be aware of, especially when migrating from Cassandra to Keyspaces. One of the more important ones is that the row size needs to be less than 1 MB in Keyspaces, where Cassandra can support row size up to 2 GB.
The major difference to Cassandra comes from the fact that Keyspaces is serverless. You won’t be deploying nodes to form a cluster, you are getting storage and tables. And even then, you don’t need to worry about provisioning the storage, you create a table and put data there, and storage magic happens in the background.
A big deal about Cassandra is the high performance it delivers, which leads us to a question. Can Keyspaces keep up with it? The short answer would be yes, but let’s dig a bit deeper to understand why. One considerable benefit, Cassandra has, is the on-demand capacity mode. As you can probably guess already, on-demand allows you to implement Keyspaces that can handle thousands of requests per second without needing to provision for it upfront. The way on-demand works is that it adapts to your actual peak workloads, scaling both up and down as needed.
On-demand is really a nice way to make sure that you always have the performance you need, and it is the default capacity mode when deploying or creating new tables. However, it is good to keep in mind that you do pay for all the resources you consume. So leaving Keyspaces to on-demand mode, without monitoring it, is a great way to generate meme-worthy AWS bills. If your workload profiles are less unpredictable, you can always go for the provisioned option, where you can define how many reads and writes you’re going to need. And if you need to, you can naturally adjust these rather easily afterwards.
As mentioned earlier, Keyspaces is a serverless product where you just need to load data to tables, and not worry about anything else. The storage, that gets automatically deployed to contain the tables, provides nice performance. Keyspaces promises single digit millisecond latency for reads and writes at any scale.
Keyspaces features and architecture.
Like all the managed database services, Keyspaces comes on with a lot of “on-by-default” goodies that you might otherwise struggle to implement. These include, but are not limited to:
- Data-at-rest encryption with AWS KSM
- PITR restore capability with 1 second RPO withing 35 days.
Naturally, you can also make use of CloudWatch for monitoring. Keyspaces is also available across 20 different regions, which is relatively good in my opinion. One thing worthy of a special mention is, that AWS is providing a Developer Toolkit as a Docker Image, that anyone can use. It’s a zero configuration, best practices by default tooling package that can be used for lightweight migration and development activities (and it also works against Cassandra clusters).
As mentioned previously, Keyspaces provides you with an endpoint with distributed compute and storage, running across 3 Availability Zones for high availability and durability. At high level, the architecture looks something like this.
While this architecture looks much simpler than what I have seen for Cassandra, the additional benefit with Keyspaces is, that it’s hosted in AWS datacenters. And AWS does know how to build those rather well.
Migrating from Cassandra to Keyspaces.
Admittedly, I have never worked much with Cassandra, so I don’t have too much first-hand experience on this topic. On the high level, though, the steps are pretty much the same, as in any other database migration. For more detailed documentation, you can visit the AWS documentation for planning and executing the migrations.
- Identify and understand access patterns
- Collect sizing requirements
- Data preparation
- Throughput provisioning
- Recognizing built in limitations, such as numbers of connections
Like in all cloud, and especially PaaS services, there is some amount of resource governance in place. This is important to note, especially when you have bulk workloads. There are also some limitations that will impact sizing and the data loading performance, but beyond that, might also introduce unexpected failures (for example due to row size limitation). So in some cases you might need to perform some data preparation steps, before you’re able to execute the data migration. And naturally, you need to plan on how to recover from any data loading failures.
When calculating throughput requirements, there are some scripts available from AWS that can be used to identify row sizes (max, min, avg) and help with the estimates. But you should still have monitoring in place, and be ready to adjust the provisioning as needed. One thing to note is, that unlike for most other databases, Cassandra to Keyspaces is not at the moment supported by AWS Data Migration Services. In fact, there are only couple options available: CQLSH and DataStax Bulk Loader.
Wrapping it up.
Alright, that’s it! Hopefully, you enjoyed reading this post, and if you are interested about the previous parts of the series, visit the links below.