Dr. Jenkins or How I Learned to Stop Worrying and Love the Autoscaling

Here at Sensor Tower, we use Jenkins CI/CD tool to automate our build pipelines running on AWS.

Always searching for excellence, we periodically review the infrastructure to improve cost efficiency and speed up our services. Let’s have a look at the exercise we made for the Jenkins worker nodes.

Our CI/CD Pipeline

We follow a continuous deployment workflow, meaning that each time somebody merges a PR, it triggers a build that ends up with publishing a new version of the app that gets deployed by other jobs that are automatically triggered. Additionally, for PR commits, build is triggered each time somebody pushes a commit to his/her PR branch.

Inside our build pipeline there are following steps:

Fetch the repo
Install dependencies
Run linters
Precompile front-end assets
Build the package to be deployed
Run unit and integration tests
Deploy the package to a staging server
Run E2E tests

If all the steps mentioned above passed — we mark the build as successful, otherwise, it’s a failure.

Once approved and green PR merged into the master branch it’s builds and tests are run against the staging infrastructure and only then it’s rolled over the production servers.

Mostly 2 of the steps above take the most of the processing power which are “unit and integration tests” and “E2E tests”. Since they take too long to finish it was decided to go with parallelization to reduce build time but that ends up with the need of a high CPU instance. See how we used Cypress to run our E2E tests

Preconditions

To achieve the best build time of our product 7 constantly running large EC2 instances were used. Such setup allowed to have several parallel PR builds w/o having a huge queue.

The approach had several disadvantages:

Each instance was static, provisioned via Ansible. This takes quite a long time to re-provision a new Jenkins agent instance if need to scale up. Also, it requires periodical maintenance, e.g. logs cleanup.
The solution has been locked to the amount of statically provisioned instances and not able to flexibly change pool size.
Running the nodes 24/7 is wasteful since although we are a global company the use varies across the day and we never work weekends at Sensor Tower.

Solution

Migrate to the AWS Autoscaling Group (ASG). With the new solution, only paying for the instance real usage time, it gave the ability to increase instance size — improving build time.

Technologies Used

Infrastructure provision: Hashicorp Terraform
AMI builds workflow: Hashicorp Packer + Ansible
Lambdas: Ruby

How To Pack Everything To Make Work

Establish a baseline capacity in the ASG to remove any wait time for the PR build to start.
When a new PR build job starts it triggers ASG scale-up for one extra instance to provision a new Jenkins Agent. The purpose of this strategy is to always have N+1 available agents to pick up new jobs.
PR build job registers the agent’s instance ID as being used in a DynamoDB state table with a TTL to prevent instances which are running PR builds from being terminated by the ASG.
When the PR build job is done - the agent’s instance ID is deleted from the state table and a Jenkins job is triggered to re-scale the ASG (desired capacity) and then finally triggers autoscaling termination lambda attached to the ASG.
When the lambda function gets the list of instances for termination, the instance IDs are verified against those that are in the State table and have non-expired TTL to exclude them from termination.

Grafana: Jenkins agents allocation diagram for 24 hours

“Houston, we’ve had a problem” (or what issues faced)

Resources Clean Up

Problem: In project phase one a Jenkins cron job which scans all registered Jenkins agents and deregisters those that are not in running state. When evaluating the solution, some jobs are picked up by Jenkins agents which are about to be terminated. This happens because of the lag between ASG instance termination decision and actual termination.

Due to AWS limitation ASG termination policy lambda is limited only for 2 seconds of execution, so it wasn’t possible to do the cleanup of terminated agents from the Jenkins controller directly from lambda.

Solution: To resolve this second lambda executed asynchronously right after the termination decision which does deregistration in the background w/o locking original lambda function.

Warm Up Time

Problem: Since the original solution was based on the statically provisioned Jenkins agents, those have been using in-directory cache for the dependency installation tasks (e.g. RPMs, Ruby gems, Yarn cache). With the Autoscaling solution it’s not guaranteed those cached folders will be present before the build job which extends Jobs execution times.

Solution: Was decided to store cached objects on the S3 bucket and sync them as a part of the cloud-init script on the instance creation phase.

Summary

Migrating from statically provisioned Jenkins agent to ASG brings flexibility of fleet size, dramatically improves resources utilization and reduces job execution time, with low overhead costs.

While numbers are great, it’s really important to note the most valuable outcome is better quality of life for our colleagues who have their builds picked up and finished faster. This project required multiple engineers from different teams to collaborate and it was great for knowledge sharing, learning and personal growth. A special shout-out to Muhammad Seyravan who has done a big part of work and helped to drive the project from start to end. By the way, don’t miss Muhammad’s blog about Cypress!

Sensor Tower's platform is an enterprise-level offering. Interested in learning more?

Written by: Andrey Zhelnin, DevOps Engineer

Date: December 2022

Blog