Faster Docker builds using a remote BuildKit instance

TL;DR

Offload builds to beefy EC2 instances running remote BuildKit with shared cache to drop 30-minute builds to under 2 minutes.

Get started!

Try us Free

Docker has fundamentally changed how developers build and deploy applications, with most companies leveraging containers in some capacity. At Blacksmith, we regularly see our customer’s Docker builds taking 30 minutes or more, which can significantly hinder developer productivity and delay the deployment of hot-fixes.

In this post, we’ll give you the exact steps needed to setup a remote BuildKit instance on AWS. The 30 minutes it’ll take to follow this blog can dramatically speed up your Docker builds for your entire org. But before diving deeper, let's review some Docker fundamentals. There are three primary levers one can pull to optimize Docker build times:

Hardware: Since Docker builds are often compute-intensive, running them on a more powerful machine with a high core-count can often make them substantially faster.
Layer caching: BuildKit has powerful support for incrementally caching layers for a given Dockerfile, and this can be leveraged in various ways to improve Docker build performance by an order of magnitude for a lot of common workflows. We've previously written about using AWS ECR as a Docker cache.
Dockerfile optimization: Crafting an efficient Dockerfile is crucial for build performance. While an in-depth discussion of Dockerfile best practices is beyond the scope of this post, Docker provides some excellent guidelines on their website.

What is BuildKit?

BuildKit is a modern backend that replaces the legacy Docker builder, offering improved performance and new features.

However, the most relevant feature for our use case is BuildKit's ability to execute builds on a remote instance. This lets us offload the build process from the local machine to a more powerful remote server.

BuildKit achieves this by using a client-server architecture. The BuildKit client, which runs on your local machine or CI/CD runner, communicates with the remote BuildKit daemon over a secure connection. When you initiate a build, the client sends the build context (Dockerfile, source code, etc.) to the remote daemon, which executes the build and streams logs back to the client.

Running a remote BuildKit instance on AWS

Let's circle back to the first two levers we mentioned for accelerating Docker builds: using a powerful machine and having a persistent build cache. BuildKit allows you to run a remote Docker builder instance on any cloud provider, and we'll walk through an example of setting this up on AWS.

By running BuildKit on AWS, we can:

Choose a compute-optimized machine: We can select a more powerful EC2 instance with better CPU performance than being limited to the often slower machines provided by GitHub runners.
Share the build cache: We can store the build cache on an EBS volume attached to the EC2 instance. This allows us to reuse the cache across multiple builds, as the Docker layers are persisted on the EBS volume.

Running the remote BuildKit instance

We've created a Terraform configuration file in this repository https://github.com/useblacksmith/remote-buildkit-terraform to automate the provisioning and configuration of the necessary resources for running our remote BuildKit instance on AWS. At a high level, here's what it does:

Provisioning an EC2 instance: The Terraform configuration launches a c5a.4xlarge EC2 instance to host the BuildKit daemon. This instance type provides:
- 16 vCPUs
- 32 GB of memory
- 100 GB gp3 SSD-based EBS volume
- This configuration is sufficient for a small team's Docker build requirements. We're provisioning this in us-east-2, but you can run it in whichever region you prefer. However, you should verify that the Amazon Machine Image (AMI) we're using is available in that region.
Establishing trust with GitHub Actions: An OpenID Connect (OIDC) provider is set up to enable GitHub Actions to assume an IAM role with short-lived access tokens. This eliminates the need for long-lived access keys and ensures secure access to the BuildKit instance and other AWS resources.

Note that our BuildKit instance is configured to run on port 9999 of the EC2 instance. The instance's public IP address and this port are essential for configuring the GitHub Actions workflow to connect to the remote BuildKit instance.

Setting up your local environment

To start, follow these steps:

Clone the repository to your local machine
Install and configure your AWS CLI with the appropriate credentials
Update variables in terraform.tfvars to point to where you’re running your GitHub Actions:
- github_org: your GitHub organization
- github_repo: your repository name
Run the Terraform configuration and wait for the resources to be provisioned

terraform init
terraform plan
terraform apply

You will notice the following output, take note of this output.

buildkit_instance_public_ip = <IP-name>
github_actions_role_arn = "arn:aws:iam::<ACCOUNT-ID>:role/GithubActionsBuildKitRole"

Configuring GitHub Actions workflow

Once the remote BuildKit instance is up and running, it's time to modify your secrets and the GitHub Action workflow files running docker builds.

In your GitHub repository, navigate to the "Settings" tab and click on "Secrets".
Add these two new secrets:
- AWS_ACCOUNT_ID: Paste the AWS secret output from the Terraform configuration.
- BUILDKIT_HOST: Paste the public IP address of the provisioned EC2 instance.
In your workflow file, make sure to reference the BUILDKIT_HOST secret along with port 9999 when specifying the remote BuildKit server endpoint (e.g., tcp://${{ secrets.BUILDKIT_HOST }}:9999).‍

steps:
    - uses: actions/checkout@v3
    - uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/GithubActionsBuildKitRole
        aws-region: us-east-2
    - uses: docker/setup-buildx-action@v2
      with:
        driver: remote
        endpoint: tcp://${{ secrets.BUILDKIT_HOST }}:9999
    - uses: docker/build-push-action@v2
      with:
        context: .
        file: ./Dockerfile
        push: false
        tags: test-image:latest
        load: true

When we triggered a build, our first uncached run took 6:22 minutes.

When we reran the job, the cached run only took 1:34 minutes.

As you can see from the logs, each layer had a cache hit, significantly improving the build time.

Caching Docker layers significantly improves build times, reducing the build duration from 6:22 minutes to just 1:34 minutes in this example. Docker layer caching is particularly effective in situations where your “base” layers observe minimal changes, as Docker rebuilds only the layers starting from the modified one while reusing the cached layers that remain unaltered. This cache will be shared across your entire org and so all CI builds will benefit from it.

Downsides

Although using a single shared BuildKit instance hosted on an EC2 machine is a simple approach, we do want to callout some drawbacks that limit its scalability and effectiveness for larger engineering teams.

Lack of autoscaling: This setup does not support autoscaling to handle fluctuating demand. Many of our customers have GitHub Actions workflows that trigger dozens of concurrent builds. If the EC2 instance does not have enough resources to handle the load, it can lead to resource contention. During peak times, this may result in slow build times and potentially cause the BuildKit EC2 instance to run out of memory. The inability to automatically scale resources up and down based on need is a major limitation.
Cost: Constantly running a dedicated EC2 instance is not cost-effective, especially for teams that are not geographically distributed. For example, keeping a reserved c5a.4xlarge instance running for an entire month costs over $250. If most of the team is in the same time zone, the instance will be idle much of the time, wasting money.
Security: Using a single shared BuildKit instance introduces security risks by failing to properly isolate projects from each other. In this shared environment, there is an increased chance of accidental credential exposure, and cross-contamination between projects.

You could explore dynamic suspension and resumption of EC2 instances, hot loading EBS volumes or using spot instances (to reduce costs). The main tradeoff is that suspending and provisioning new instances would increase CI wait times since provisioning a new instance for each build has a cold start time associated with it. AWS has pointers on decreasing the boot-up time for instances using EBS volumes.

In conclusion, using a powerful remote BuildKit instance can significantly reduce Docker build times for small to medium-sized teams. While there are scalability concerns, we’ve seen many teams get very far with this solution as it offers a simple yet effective way to improve build performance, allowing you to focus on what matters.