Blog

SENSOR TOWER · ENGINEERING · IVAN TAKARLIKOV · MAY 2022

Graviton Journey: MongoDB and Beyond - Part 2

In the second part of our Graviton Journey blog series, we get into mongo migration challenges and how we've faced them.

Blog---engineering3

Mongo Migration Challenges

In the previous blog post we talked about the first steps of our migration. In this part we’ll be only focusing on mongo migration challenges which we faced.

From now on, further chapters of this tale will be entitled with just one thing—a particular MongoDB version number—you'll see why.

4.4.4

image9

After a few mongod processes just stopped we found some pattern in it. That problem appeared only on primary instances and only on those that have uptime of 18 to 20 days or longer.

Also, we found the following MongoDB Jira issue, which described the problem, and also stated that the fix will be released in the upcoming version (4.4.5). It's great to see quick reaction from the Mongo team, but remember that we had to do something with our primary instances right now!

We ended up with the following options at hand:

  1. Migrate our databases back to x86 and mongo 4.2

  2. Do a temporary workaround

Downgrades are always more complicated than upgrades (and MongoDB is no exception) , but monkey patches is always fun we decided to implement monkey patch that will preventively once per day run the script that goes through all databases, and for all databases that have uptime more than seven days, forcefully restarts it. On top of that, we added the AutoRestart option to MongoDB's systemd and also made a new slack notification if the mongod process stopped. 

Luckily, the Mongo team released the 4.4.5 release candidate version pretty soon and we decided to wait for a final release with the above-mentioned workaround in place.

4.4.5

On April 9, the Mongo team released version 4.4.5 and we decided to apply it for all our mongo instances that were previously migrated to 4.4.4. In a nutshell, the upgrade procedure is pretty straightforward, I’ll describe it once here.

mongo –eval “rs.stepDown” # to make primary - secondary

sudo service mongod stop

sudo yum upgrade mongodb-org

sudo service mongod stop

# wait until the node gets re-elected as primary again

During the day, we upgraded all our primaries to the new version and decided to leave it for a few weeks before proceeding with the rest of non migrated instances to understand how stable version 4.4.5 is.

But April 16 brought us some more surprises…

One of our instances that historically didn’t have any performance issues started to breach the 100 percent CPU usage and became unable to handle the load. Logically, we first checked if the load pattern changed (amount of queries, types of queries, etc.) but haven't noticed any changes. Also there were noе changes in database data structure and indexes. The only thing that changed was the fact of upgrading to version 4.4.5 and moving to ARM one day before the incident. 

The first mitigation step that we took was upgrading the instance type, r6gd.12xlarge to r6gd.16xlarge, but it also didn’t help us. After that we tried to spawn up a new primary on x86_64 architecture on the same MongoDB version and temporarily make it a primary in the RS, and it helped!

Looks like the performance issue was caused exactly by a combination of a specific version (4.4.5) and Graviton2 processor. You can see it on the graph below, where the green one is ARM database, and the yellow one is our x86_64 buddy. The red vertical line is a time when we switched load from one primary to another.

image1

This appeared to be the right moment to actually get in contact with the Mongo team, so we created a ticket in Mongo's Jira. All thoughts and ideas were gathered and ended up in  https://jira.mongodb.org/browse/SERVER-56237

After creation, the Mongo team asked us to collect some additional diagnostic.data and logs. 

We decided to clear our logs from some sensitive data, there is nothing to describe, just awk/cut/jq/grep black magic. 

After a week, we got an answer.

image7

Essentially a “Yeah, it is clear, please wait till GCC 8.5 release, it will have support for required build flags, our mongo package ready for that already."

4.4.7

Soon enough, GCC 8.5 releases at the end of May 2021, and after that, with a slight delay of about two months (GCC 8.5 released of May 17, Mongo 4.4.7 released July 17), we had a brand new release to test.

Right after that, our team started to upgrade an instance that had a performance issue (BTW, during investigation that instance migrated back to x86_64). Luckily, all performance problems were gone after migrating to ARM and upgrading to 4.4.7!

image2

The red vertical bar on the graph is the migration time. If you have a sharp eye, you probably noticed that CPU utilization increased, but the main reason for it is the fact that we additionally downscaled the instance from r5d.16xlarge to r6gd.12xlarge (64 CPU to 48 CPU), so this increase is totally understandable. 

We migrated all our 4.4.x instances to 4.4.7 and decided to leave it untouched for a few weeks/months, just to be extra sure that the current version is stable.

But on August 18 we got a message.

4.4.8

In our engineering-mongo channel in Slack, our CTO sent the message:

image8image3

That does indeed looks very serious, because Mongo could potentially behave like this from now on:

- [Developer]: “Hey! There are `d` and `c` fields in the document, please make sure that it is uniq!”

- [Mongo]: “Yeah, for sure, bro, you can rely on me!”

- ….

- [Incoming insert request] {d: ‘Foo’, c: ‘Bar’}

- [Mongo]: Looks correct. Saved.

- [Incoming insert request] {d: ‘Foo’, c: ‘Bar’}

- [Mongo]: ???? Still looks correct. Alright, saved

So, yes, as bad as it seems on version 4.4.7, Mongo can not fully guarantee uniqueness of the documents when it is required.

With all that said, we hope that upgrade should not break anything, and we also already have a good set of automation tooling for bulk upgrades, so, we are pushing to “Big Red Button” and upgrade everything to 4.4.8.

For the first few hours everything looked good, but after few hours the alerts started:

Segmentation Fault

Luckily, from the 4.4.4 version of mongo we had auto restart on our instances, so, we just have interrupted connections from time to time, but didn’t experience significant downtime. Instances just restarting sometimes, but it is definitely not the desired behavior of a production database.

Urgently, we downgrade most of our instances back to 4.4.7, and a few instances that had performance issues on 4.4.7 we migrate back to 4.4.6, and file another another mongo ticket

https://jira.mongodb.org/browse/SERVER-59448

image10

4.4.9 - 4.4.10

Two months after filing a ticket, we got an answer saying that version 4.4.9 was released, please upgrade, it is likely to fix our issue as well. Also, mongo has a thing called validate(), which you can run to be extra sure that you don’t have any corrupted data.

image4

That sounds pretty good, but description of validate() contain these lines:

image6

That tells us that validate command is:

  1. Slow

  2. Obtains exclusive write lock

These two points don’t allow us to run validate() on primary.

We are trying to run validate on secondary. It stopped the replication process (understandable so) and started to validate… and finally after 19 hours it validates 5 percent of data in one of our databases.

We find that completely unacceptable and decide to run that check on secondary instead, but with an NVMe based instance with the same size as the primary instance. Or even make it a new primary and run validate on the old primary!

After those manipulations, we ran the validate script on the previous primary (yeah, it runs much faster and finishes in a few hours). So, we get the results and it looks like there are no issues found. That must mean we can safely upgrade our primary to 4.4.10 (yeah, while we were playing around with validate script mongo team released one more patch version).

And it fails with segfault again…

I’m ready rush into comments of our mongo ticket and express my emotions there, but our CTO suggests something different.

“Lets try to re-sync primary from scratch (not with rsync, with replication operation of mongo), and do it on 4.4.10 primary instance),” he says.

Sounds like a bright idea. We apply migration with replication and wow, the server works on 4.4.10 and doesn't fail even after one week!

We keep that for a few more weeks to check release stability and find the time to automate full re-sync, because it suddenly becomes slightly more complicated than before.

Final Round

So, the last action separates us from the point where all databases are on ARM and 4.4.12 is primary migration!

First of all, I decided to automate everything completely and spawned a dedicated instance just for me, that will run migration processes in separate tmux sessions to not get it interrupted due to arbitrary connection issues.

Also, I slightly modified the migration script that we already had and final version looks like that:

  1. Provision new primary with suffix -new (a.k.a Primary-new)

  2. Stop mongod on Primary-new and Secondary

  3. Rsync data from Secondary to Primary-new

  4. Start mongod on secondary, allow it eat up the replication lag

  5. Start mongod on Primary-new

  6. Add new Primary to ReplicaSet as a secondary to allow it catch up the replication lag

  7. Stop mongod on Primary-new

  8. Remove new Primary from the replica-set

  9. (if database is important and many background workers uses it) Pause all sidekiq queues and wait for a 5 minutes

  10. stepDown existing primary (force it to be a secondary for 60 seconds)

  11. Stop mongo on existing primary

  12. Switch ENI from old primary to the new one

  13. Start mongod on Primary-new

  14. Make sure that mongod on Primary-new is actually Primary and works well

  15. Unpause sidekiq background queues if they were paused

  16. Rename Old and New primaries

With given automation and dedicated host to run automation from there one DevOps engineer can manageably migrate around five databases at the same time!

In our case, we migrated all our databases within one week without any incidents!

Arbiter

After a few days after migration, we noticed that we are still running on FeatureCompatibilityVersion 4.2, but all of our databases are actually on 4.4 and I think it’s a good time to migrate FCV to 4.4!

It could appear pretty simple, just go to mongo console on primary and run 

```

db.adminCommand( { setFeatureCompatibilityVersion: "4.5" } )

```

We decided to go ahead and test it on one of our primaries, chose a less important one, and upgraded the FCV there.

For the next 10 minutes everything looks good, but then we got a notification that our arbiter can not connect to the primary. We logged in to the arbiter and realized that it is still on 4.2!

So, we decided to upgrade the mongo package on arbiter and use 4.4 there. 

Let me remind you once again, our arbiter instance service all our mongodb ReplicaSet, there we just run many mongod processes on different ports for every ReplicaSet, so, migration process would be pretty straightforward:

  1. Stop all mongod processes

  2. Run yum upgrade

  3. Start all mongod processes

Sounds easy, yeah?

We followed that script, everything was good, I went away from my laptop to get my coffee, and how surprised I was, after my come back, arbiter processes started to die one by one!

Moreover, when I logged in to server it showed me something like 

```

-bash: fork: retry: No child processes

```

for any input!

We checked the amount of established network connections, it was over 80,000! It just overloaded our system and didn’t allow us to process the simplest operations on the machine (like executing a simple console command). We restarted the machine to stop all mongod processes and right after startup downgraded mongo back to 4.2. 

When all mongod processes started, the amount of connections was around 20,000 to 30,000, that is also many, but the server can handle it. 

Why did the amount of connections increase after mongo upgrade? We still don’t know.

To fix the root cause, we decided to make the arbiter “hidden” in ReplicaSet. What does that mean?

Currently, when client connects to replica set, it can connect to any member of it and ask “Where can I read/write data?” The answer would be the same for all members: “On primary” and reroute client to primary.

That means that our arbiter also accepts clients connections and just reroutes it to primary.

But if a member in ReplicaSet is marked as “hidden”, it is not shown for initial discovery after client connection, and clients can not connect to it, so it reduces load to arbiter.

So we just marked the arbiter as hidden in every RS, and to be extra sure restricted access to arbiter mongo from client applications.

Conclusion

Finally we can mark that issue done here. All primaries, secondaries and arbiters are on 4.4.12 and Graviton processors and have been working pretty well for more than two weeks already!

Why did it take so long to migrate to Graviton for Mongo? 

Actually, the main reason is that we tried to use cutting edge versions of mongo right after release. These versions had bugs, but what OSS software don’t have them? I mean, this is always a tradeoff—to use freshest OSS but be a beta tester of it, or to use slightly outdated but more stable releases.

We chose the first option, and it was a long but interesting journey that improved not only our infrastructure but also our knowledge about Mongo, ARM, and many other things around!

But yeah, I think if we decide to start with an upgrade now, on 4.4.12 version of mongo, it will take no more than one month total, but who knows?

What about achievements?

The first and most important achievement you can see on table below: 

image5

$7,700 cost saving every month without any cons! (Except migration process for sure.) 

If we talk about non-obvious achievements, we significantly improved our automation tooling around the mongodb. Now we can: 

  • Upgrade/downgrade patch versions on all our mongos within a day

  • Upgrade and resync all secondaries from scratch within 1.5 weeks with around 1 to 2 days of time spent of one engineer

  • Upgrade and rsync all primaries within a week with time spend around 30 minutes for every database for one engineer

Also, our team now understands much better than before how to handle the mongo issue, what might go wrong, how to handle it, how to collect logs/telemetry, and whom to ask.

We slightly improved our arbiter installation by cutting unnecessary connections from the clients, now it can handle the same load with a 2x smaller server.

And for sure, I hope we helped Mongo to become better and saved some time for the rest of the community, by finding issues on our own and testing early releases. 

Here is our story about ARM, maybe you should start your own?


Sensor Tower's platform is an enterprise-level offering. Interested in learning more?


 Ivan Takarlikov

Written by: Ivan Takarlikov, Infrastructure Engineer

Date: May 2022

Never miss an insight. Get quality content sent to your inbox every week.


Thank you!


By clicking "Subscribe" above, I agree to the Terms of Service and acknowledge the Privacy Policy.