Continuing with the topic of purpose-built database on AWS. This time, I’ll be diving into the wonderful world of Document stores. For a while now, MongoDB has been the gold standard for Document databases. However, as of late, I have come to think AWS DocumentDB as a solid alternative for MongoDB as a document store.
And that is one of the reasons I am focusing on AWS and DocumentDB on this post, it’s an actual purpose-built Document store, rather than a multi-model database, such as CosmosDB. CosmosDB offers a wide variety of APIs to use, Document, Graph, Column and Key-Value, making it a multi-purpose database. The reason I am not touching on DynamoDB is, that the migrations from MongoDB to DocumentDB are much easier.
Before reading further, I’d recommend that you check out the first post in this series titled: AWS Purpose-Built Databases, Part 1 of 3. If you already did that, read on.
What are purpose-built databases?
I have already covered some of the most common purpose-built databases for Document, Graph, Time series, Key-Value, etc. on my first post, so I will not repeat the listing here. As a summary, these are databases that are designed for a singular purpose (or in some cases for dual purpose), and they tend to perform rather well in that role.
AWS DocumentDB is a fully managed, scalable and, this part is important, MongoDB compatible database service. Naturally, all the common Platform-as-a-Service (PaaS) benefits apply to it, including out of the box high-availability, automated backups, security etc. One thing to be mindful of is, that DocumentDB doesn’t provide a public endpoint, and are by default only accessible from the VPC. If you really, really need to access it from outside, you can set up an SSH tunneling for it.
The compatibility to MongoDB is little shy of 100%, but AWS claims that majority (note, not all) of the applications that use MongoDB can use DocumentDB without any application changes. When it comes to cost of running it, compute (+memory) and storage for the DocumentDB can be scaled separately, and it can scale as high as 64TB of storage. The initial cost seems higher than it would be for DynamoDB, but not so considerably that I would pick DynamoDB over DocumentDB as a target for MongoDB migration.
DocumentDB can provide very high levels of availability, depending on the number of instances and Availability Zones. As I mentioned the running costs previously, the billing is consumption-based per second. When it comes to cost control, the DocumentDB clusters can be stopped, which can be a good way to save on running costs of Dev and Testing environments.
DocumentDB cluster can have up to 16 instances running, and data for it is replicated across three Availability Zones. From the compute perspective, the nodes can be sized from between 2 vCPUs to all the way up to 96 vCPUs, with 4GB to 768GB of memory. All the writes to the storage go through the primary instance, and replica instances can be used for reading. The diagram below provides a high-level view of the architecture.
Why go with DocumentDB?
The main reason people are using of Document Stores is, that it gives them schema flexibility. If that is the requirement you have, there aren’t too many good counter-arguments to a fact, that a database, which is purpose-built for storing JSON documents, is the appropriate one for you. If it’s architected for cloud, coming with all the benefits of the PaaS and a consumption-based costing model, it is getting even more difficult to find other alternatives.
Migrating from MongoDB to DocumentDB
A migration from MongoDB to DocumentDB typically consists of four phases.
- Workload Discovery
- Migration Planning
- Migration Testing
- Migration Execution
Next, we’ll cover some details on all of these phases.
The workload discovery phase can be split into smaller units of work, such as verifying the compatibility, getting the inputs for correctly sizing the target environment and understanding the data management characteristics.
For reviewing the compatibility, there are some practices, documentation, and tooling provided by AWS.
- Check the source environment versions, including the drivers
- Go through the developer documentation in DocumentDB developers guide
- Run Amazon DocumentDB Compatibility and Index tools
Understanding the performance characteristics is also important, and provides an input to your planning phase, so you want to go and look at the following.
- Avg data size, growth, backup
- Averages and peaks for read and write operations
- Percentage of time at read and write peaks
And finally, there’s the data management perspective to consider of. This revolves around questions related to the following topics.
- RPO and RTO for the system
- Data and backup retention times
- TTL for indexes
In the planning phase, you’re planning for both, the target environment and the migration approach. Using the inputs from the workload discovery, you’ll need to consider clustering requirements and what migration tools to use. For the target environment sizing, AWS does provide a sizing tool that utilizes a mongostat output for this purpose. For the clustering high availability and performance, there are some things to consider.
- Availability requirement (how many 9’s you need)
- What type of multi-AZ configuration is needed
- Use of read replicas
- vCPU counts, connection limits
- Reads and writes per second
MongoDB to DocumentDB migration supports three different scenarios.
When it comes to selecting methods, it almost always comes down to the allowed downtime the system has. If you can take the system offline for an extended period of time, then you can go with native tools to keep it as simple as possible. Stop incoming traffic to MongoDB source, then use mongodump and mongorestore to move the indexes and the data.
And online approach is based on using AWS Database Migration Service. It starts with a full load of data to target, then proceeds to enable Change Data Capture (CDC) to replicate changes. It allows closest to zero downtime migration, shifting the work that needs to be done to left, into preparations.
The hybrid approach, as you probably already guessed, combines the two previous migration approaches. This is commonly recommended when your data size goes above 1TB, anything below that should be easier to move with either online or offline approach. In the hybrid approach, you first dump the indexes and data and restore it to DocumentDB. At this time, your applications will continue to write to MongoDB source system. These changes will be captured using AWS DMS, which starts to replicate data to DocumentDB you previously set up from the index and data dumps. Once you’re hitting the maintenance window where the migration was planned, you just point your applications from MongoDB to DocumentDB, and you’re pretty much done.
Testing and planning for it
While it’s obvious to everyone why you need to test this type of migration, I do want to mention something about the planning part of it. To successfully complete the migration testing, your testing plan should include:
- Defined goals
- Defined success criteria (Go or No-go decision)
- Defined cutover process (ownership of the migration steps)
After you perform the testing, you’ll also get valuable inputs that you can compare against your defined goals. If, for example, your goal was to perform migration under 1 hour, and it takes 3, does that lead to go or no-go decision?
The testing phase of the migration is where you should be spending most of the time for the project. During the testing you will typically learn quite a bit about the actual migration process, and about potential blockers. To get most out of the testing, you should be utilizing all the available tools, from the MongoDB and from the AWS Cloud. Setup CloudWatch for collecting logs and metrics, time all the steps in the plan, and most importantly, verify that you didn’t miss any steps.
And finally, I would definitely be running a performance and data correctness testing for migrated workload. When changing database engines, there can be some nasty surprises that will not surface without proper testing.
Executing the migration
With all the testing done, you should have enough confidence on the selected migration approach (offline, online, hybrid) to perform the actual migration. The better your testing has been, the better this phase will go. In an ideal situation, the migration should be just repetition of the migration testing done earlier.
As a part of the migration, one phase that often gets overlooked is the post-migration optimization. Even when you have done the workload analysis with care, workloads that land on cloud run on quite a different type of (typically) shared platforms. Having proper monitoring in place allows you to quickly validate the sizing and the performance characteristics of the workloads.
Wrapping it up
Thanks for reading! I hope you found this post on AWS purpose-built databases useful. And if you did, I would recommend that you read my other posts on the topic