We’ve acquired Video Game Insights (VGI)!
Here at Sensor Tower, we have a fairly large infrastructure running on AWS. It has a significant cost and that is why we are constantly looking for ways to make it more reliable, performant, and less costly. For these reasons, it’s important to follow announcements from Amazon, and Graviton2 appeared to be exciting. On paper, ARM instances just perform better at lower costs, so you’d just need to migrate your servers to a different CPU architecture. It sounds easy, right? The rest of this article will cover benefits we’ve gained as well as various challenges we faced along the way.
Actual migration from x86_64 to ARM is pretty straightforward and easy, but related updates of software to latest versions that are compatible with ARM—that might be complicated as every update to the latest version of critical software.
In a nutshell, we spent a lot of time to get stable mongo release, saved 30 percent of costs, and didn't get significant performance improvement.
Our CTO stumbled upon the article about the Graviton2 processor benchmarks in the newsletter. That article describes a comparison of different databases running on Graviton2. TL;DR, MariaDB doesn’t get a significant performance boost, but Redis gets a boost of around 27 percent with cost-saving around 20 percent at the same time! We discussed it with our infrastructure team, then after some clarification and checks that Redis can run on ARM architecture out of the box, we decided to migrate Redis servers to Graviton2.
By the way, we have around eight different Redis instances of different sizes in our infrastructure. The biggest one is r5.4xlarge with 16 CPU and 128Gb RAM. It is still a small piece of our infrastructure, but anyway, it can give us some measurable results in cost savings at least.
Redis migration process is not very difficult, because, first of all, we can allow some downtime of that server (less than 1 minute is totally fine), and instance doesn’t contain any difficult custom software installations, basically, it is just raw machine from Amazon Linux 2 AMI, with some installed system libs from yum and our Redis rpm package.
Yeah, we use our own rpm build of Redis, but just to lock package installation and configuration of systemctl service and Redis config, to not get it accidentally broken after Redis migration to another version, and also to have that rpm cached in our S3 for the cases when external yum repository is unavailable. Additionally, we build rpm packages on our Jenkins nodes, and because Redis didn’t support cross-compilation at the time of article writing, we had to spawn a new Jenkins worker that runs on a Graviton processor, and had it compile our Redis and put it into our rpm repo.
Finally, we just spawned a new instance from fresh AMI of AL2 that supports Graviton2, ran ansible-playbook with Redis rpm, fixed a few small issues during playbook provisioning and it was ready to run!
After successful tests for instance provisioning, we decided to not drag it out for too long, and migrate one of our production instances, chose less critical, and went to experiment!
The Redis migration process is pretty simple (so simple that it can be automated with just 150 lines of Ruby code), but initially, I want to share how we use DNS records and IP addresses in our infra.
In a pretty conventional way, we use ENI and attach DNS records of the desired resource to the private IP address.
As you can see, we have a Route53 DNS record that points to our ENI, and ENI attached to the instance. When we want to change an instance that we use under a given DNS record, we don’t need to change A record and wait for a DNS cache with the uncertain state of the instance where traffic could be split between old and new one. Instead, we’ll just change ENI attachment from one instance to another, it happens instantly, and all requests will be routed to the new instance momentarily.
So, migration of Redis looks like this:
Provision new Redis instance on G series of instances (a.k.a New)
Stop Redis process on New instance
Do bgsave on Old Redis
Rsync given dump from Old -> New
Start New Redis
Detach ENI from Old instance
Attach given ENI to New
Done! You’re awesome!
In this way, we migrated all our eight Redis instances and didn’t notice any troubles around.
Although we were not expecting any improvements in CPU load and mainly focused on cost reduction, it was still nice to see minor CPU load decrease all over Redis instances.
This is related to readiness of our Redises to deal with CPU load to handle some extreme cases (like overflow of Sidekiq queue, unexpected cache expiration, etc). Generally, these instances are under-loaded yet ready to handle peak loads if needed.
But price comparison for the same instances, in fact it looks much better!
Thus, with just a few days of work of one DevOps engineer, we managed to cut $600 off our monthly bill without any cons!
Because of this success, we decided to go ahead and try to migrate our MongoDB instances as a substantial part of our infrastructure in terms of costs.
It’s worth mentioning that we were interested in ARM even half a year before our experiments with Redis, but, unfortunately, MongoDB (as of July 2020) didn’t support ARM64 on Amazon Linux 2 (it supported only Ubuntu 18.04). But we were inspired by Percona research about the MongoDB ARM performance article and decided to clarify timelines about getting Mongo on ARM64 AL2 from Mongo Team—our CTO created an issue in their Jira.
And fortunately, one month after Redis migration and cost/performance comparison we got an answer from Mongo Team that they added support for AL2 ARM64 on Mongo 4.4.4, so we decided to proceed with testing ARM for our Mongo infrastructure.
Before getting into the weeds, a few words about our MongoDB usage here at Sensor Tower. Currently, we have around 30 replica sets running in different instance types, from 200 GB to 11 TB of data each. Every ReplicaSet has its Primary and Secondary. Primary is handling all the reads and writes, Secondary just replicates data and acts as stand-by continuous backup. In case of primary node failure, secondary would be reelected as primary and would be able to handle the load for some time while ops engineers coming to help.
Finally, both instances connected to an arbiter mongo process that is placed on a different machine and serve only election requests. So, every replica set member has its priority and the arbiter can choose the current Primary based on that, and in a case when the actual Primary becomes unresponsive the remaining two nodes have a quorum to promote secondary nodes for primary.
Our primaries work on NVMe-based Amazon instances, but secondaries (that usually are not loaded by anything except replication) - EBS backed.
We decided NVMe for primaries because we need maximum read/write performance on primaries, and as soon as NVMe physically attached to the SSD disk to the machine, it can provide maximum performance for us.
EBS for secondary is a good choice because it is pretty flexible and allows us to create a snapshot and detach/attach volumes from instance to instance, so, it is exactly that we need to have from backup instance.
Example of read/write performance that we need from primary DB:
The diagram below visualizes our MongoDB deployment:
Instance storage here is NVMe Amazon Instances like a r6gd/m6gd/c6gd/im4gn/is4gen
Obviously, we decided to do it in the safest way possible, i.e., first migrate secondary and only then proceed with primary node. If everything looks good for a week or so, we'd proceed with other replica sets.
17th February 2021 - We tried applying our existing playbook for mongo secondary, but with ARM64 based AMI on Graviton instance. As expected, it didn’t go very smoothly. Besides some minor package dependency errors, we faced something else - filebeat and collectd wouldn't start.
Collectd - is a daemon that collects system-wide statistics about hardware and custom software metrics and pushes it to some warehouse, in our case Graphite. We use Grafana to visualize it.
In our case, issues weren’t about ARM, but about MongoDB upgrade to 4.4.4 and some API used by customized mongodb_info.py script based on signalfx/collectd-mongodb that we use to get some specific metrics like a “Replication Lag”, “Replication Active Nodes count”, “Active Clients connected”. It just wasn’t able to start. We just found that the API changed, fixed it in the mongodb_info.py script and started to work as good as before, so, let’s mark it as minor :)
But the filebeat issue is much more interesting…
Filebeat - is a single part of ELK stack that we use for collecting/processing and rendering collected information developed by elastic.co. Specifically, filebeat installs to client part to collect logs on the host and send it to remote Logstash process that will process it and send to Elasticsearch that will index that and make it available for requests from Kibana, that will render data by queries and render nice plots and graphs about every machine in our infrastructure that sends logs to logstash with filebeat :)
The main issue that we were stuck with - is that we used filebeat version 5.6 all across our infrastructure, and version five of ELK products accordingly. Surprisingly, Elastic releases official ARM support only in version 7 and above. I tried to solve it in a few ways, in particular, tried to compile filebeat from sources for a target architecture, but it didn't work right away and seemed to be rather complicated. Then we tried to upgrade only Filebeat without upgrading the rest of the ELK stack, but the version gap was too big and clearly incompatible. And finally, we tried to yolo ELK to the latest major (without any investigation and complex upgrade process, but on a different machine restored from the snapshot) that didn’t work either :( Finally, we weighed all pros and cons about having Filebeat on our mongo instances and decided just to not fix it now and plan a proper fix later, after we knew that everything else works well.
In our case, the main task is to migrate data because instances anyway would be provisioned from scratch. All data of the database is stored on a separate EBS volume that is mounted to /mnt/mongodb-data and used only for storing DB data. So, the only thing we need to do is to migrate data from that volume to another one, and that is how we can achieve that:
Rsync data from the old volume to the new one
Cons:
It will require stopping mongod process on source volume to allow rsync copy finalized set of data for copy
Downtime is pretty long. Copying 1 TB of data (and it's not by far not the biggest instance we have) takes at least 2 hours. Remember that we're copying from the EBS volume.
We can simply create the new secondary instance, add it to ReplicaSet, wait while it synchronizes with the rest of ReplicaSet, remove the previous secondary from RS after that and rename the new one.
Cons:
The replication process by mongo takes much longer than rsync because mongo replicates only data, but indexes have to be rebuilt on a new instance. Our secondary instances are not powerful, so, for a database with many indexes replication can take more than x10 times longer compared to rsync.
Pros:
Most safe way - no downtime at all!
Buut… Mongo data on secondaries stored on a separate EBS volume, we can simply detach it from the old instance and attach it to the new one! We’ll just stop the process on the old instance, unmount and detach volume, attach and mount it to a new one, reattach ENI to a new instance and start mongo there!
Cons:
After starting the mongod process on a new instance it requires some time to startup (it recalculates some caches, no more than 15 mins from our experience).
Total downtime is about 20 mins to migrate volume, ENI, and start mongo on a new instance.
Pros:
20 mins of downtime for a secondary node is everything it takes to migrate it (excluding instance provisioning, of course) - pretty fast, don't you think?
As soon as it is secondary, we don’t need to provide 100% uptime and can accept its partial unavailability. So, in our case #3 looks perfect from all sides, and we started to execute it with one of our secondaries!
The migration went pretty well, and we kept the old instance for a few days just to be extra sure that everything goes as expected.
What are the results we can notice after a few days of observation?
As you can see on the image (where the red vertical line is the switch time) overall CPU load decreases, including the spikes which become smaller.
And with good expectations, we decided to migrate our first Primary instance.
Actually, primary migration is significantly more complicated for the following reasons:
Primary instances serve all read write load from our background jobs/website
These instances are running on NVMe machines that are hardly bound to SSD and cannot be detached from the instance(like EBS).
As a result, we have 3 possible options of primary migration:
Spawn new primary and add it to ReplicaSet and allow MongoDB native replication to resync from scratch, make it main primary after and delete old one
Pros:
Safest option, because ReplicaSet is fully operational at any time during migration and only small downtime required at the moment of the switch (technically it is not even downtime, just a global connection interruption for all clients at the moment of primary switch)
Cons:
ReplicaSet synchronization takes time, as mentioned earlier where we described migration of the secondary instance.
Spawn a new secondary with EBS volume restored from the latest snapshot of existing secondary, add it to ReplicaSet and allow it to synchronize (as soon as the snapshot is fresh, synchronization time would be minimal, because MongoDB would only need to replicate the delta from the moment of snapshot creation). After that, spawn a new primary, stop the mongod process there, rsync data from the newly created secondary.
Pros:
Rsync takes a predictable amount of time and works pretty fast as I mentioned before. And also ReplicaSet would be fully operational and safe to use during migration
Cons:
It requires quite some automation and attention from DevOps engineers in the process of migration. Plus we'd need to define new cleanup procedures because many extra resources are required, e.g. an additional secondary and snapshots. Process looks overcomplicated.
Do everything like described in the previous pont, BUT to use the existing secondary. So, just rsync data from secondary to new primary, change ENI and start new primary
Pros:
Fastest and simplest option
Cons:
Requires to stop the mongod process on secondary for sometime (while rsync is in progress). It is actually not really a problem because most of the time load on instances is predictable and secondary uses only for backup purposes. In fact, the cases where primary would fail and secondary has to step up and handle production load are so rare, that we accessed this risk to be acceptable and being lower than risk of inventing and running more complicated migration procedures.
After long discussion and testing of every option we decided to go ahead with option 3 because it is easiest for automation and fastest in terms of time spent. Pros definitely outweigh cons.
Finalizing that, we started to migrate primaries.
After migrating one of our well loaded mongo primaries we asked one of our Data Science team members to measure performance on one of the existing background jobs, for which we had a performance report going back in time where it was still an x86 instance. Results were very good (on the image below, X axis represents hour, Y axis - amount of completed jobs per hour).
As you can notice, performance boosts around x1.3-x3 times, and it was awesome!
But unfortunately we didn't have much time to celebrate, because 3 weeks after migration our precious Primary instances started to regularly fail for unknown reasons.
It’s not the last point in our journey. In the next post we’ll talk in more detail about all the challenges of mongo migration specifically. If you are curious to learn more about nitty-gritty details stay tuned for part 2!