Distributed build service for monorepos

In the world of version control systems, a monorepo — short for single repository — puts the code for multiple software development projects in one place to facilitate code sharing, tooling, and versioning. It offers several advantages over the alternative strategy — microrepos, or separate repositories for each part of a system — which is why DoorDash is transitioning from microrepos to a monorepo structure.

But despite simplifying dependency management, supporting atomic changes across projects, and making refactoring easier — all while promoting code reuse and consistency in the build process — monorepos come with challenges, too. Scalability is an issue, as are longer build times and more complex access controls. In this post, we will explore how our in-house build service manages the challenges while ensuring fast and reliable monorepo changes while serving as the backbone of our continuous integration, or CI, pipeline.

Challenges

At the time this system was built, DoorDash primarily used Kotlin on the backend with more than 200 microrepos. Some of these repositories were larger, had a Gradle multi-project build, and included integration tests, making CI times less than snappy. Additionally, it was onerous to load a larger repo into an integrated development environment. Ultimately, these challenges motivated DoorDash’s shift toward the monorepo paradigm for our Kotlin services. As the number of projects in the monorepo grew, build times grew alongside them, and it became apparent that we needed a scalable and efficient build service. Additional background information can be found in our Developer Productivity Engineering Summit 2023 talk.

At DoorDash, the standard protocol for microservice communication is Google Remote Procedure Call, or gRPC, with protocol buffers, or protobuf, representing service interfaces and data schemas. Our protobuf monorepo was established about five years ago when we initially embarked on the monolith to microservices endeavor. There was a strong need for testability in the pipeline and parallelization on both the package and target-language levels.

In the following sections, we will address the specific challenges in both these problem types and how we extended the same approach to solve them.

Choosing build tools

Currently, DoorDash’s Kotlin monorepo includes more than 330 Gradle projects, while the protobuf monorepo contains more than 500 domains. Many projects come with a complex build graph and increased build volume when changes are made. To ensure a smooth development experience, it's crucial to minimize overall build times while maintaining accuracy. Key requirements to accomplish this include build avoidance, minimizing rebuilds, parallelization, scalability, fault tolerance, and clear failure reports.

While several build tools can handle large-scale repos, we focused on these three:

Bazel: A fast, scalable, multi-language build system that excels at parallel builds and minimizing rebuilds. Because of its steep learning curve and the significant effort required to migrate from Gradle, especially for customized Gradle configurations, we chose at this stage not to use Bazel. We are, however, adopting it now as the default build system for non-Kotlin languages.

Buck: Like Bazel, Buck offers high performance; Buck2 shows even better results. However, migrating to Buck would also require fundamental changes and slow development during the transition.

Gradle Enterprise: This tool provides valuable insights into build performance, improved caching, distributed testing, and predictive test selection. Although it works with our existing build pipelines, it would require customization for distributed builds.

Our engineering org was not willing to take on the cost of transition to either Bazel or Buck for a variety of reasons, including:

The learning curve would slow developers down
Compared to Gradle, the Kotlin development experience was less than ideal.
It would be difficult to replicate Gradle’s robust ecosystem of plugins.
It also would be tough to replicate our internal Gradle plugins, which reduce build boilerplate, optimize Maven repository settings, enforce internal code quality rules, and apply standard application settings.

Given these factors, we decided to minimize our migration effort and extend open-source Gradle by developing an in-house build service that meets all requirements for efficient monorepo builds while preserving the current developer experience. This service handles full monorepo builds in under 25 minutes, with a P95 build time of under 10 minutes.

Distributed build service architecture

The distributed build service uses the following key terminologies:

Build project: An individual source code project to be built and tested, which could be a single Gradle project or a protobuf package.
Build event: A collection of build projects that are built together. A user’s change to the monorepo initiates a build event, which may encompass multiple build projects.

As shown in Figure 1, there are three main components in the service:

Web: The build service entry point, responsible for initiating build work and checking status.
Controller: Manages coordination and distribution of build projects, publishing them to the workers.
Worker: Nodes responsible for executing the actual building and testing processes.

*Figure 1: Distributed build service architecture*

Changed targets calculation

Before starting a build, we first calculate the changed targets to minimize the scope of what needs to be built when changes are made. This step also determines the build sequence if multiple projects are affected.

To do this, we analyze the dependency relationships across all projects in the monorepo and create a reverse dependency graph, as shown in Figure 2. Starting from the directly changed projects, we traverse the graph to identify all impacted projects. Building and testing only these affected projects ensures the changes are verified.

*Figure 2: Example dependency graph: When lib3 is changed, the minimal set of projects that need to be built and tested are lib3, lib2, ser2, and ser1.*

Web application

As the entry point for the build service, the web application is responsible for:

Receiving build event requests from the CI system.
Analyzing the build projects within the event and initializing build statuses.
Sending the build event to a queue to be processed by the controller.
Checking the build event status and reporting back to the CI system.
Sending termination signals for build events if necessary.

The web application is lightweight because it does not handle the actual building process.

Controller

The controller coordinates distributed builds and manages the distribution of build tasks to workers. Key responsibilities include:

Categorizing projects based on their types.
Controlling throughput by managing how many projects are built in parallel.
Sending build tasks to the project queue to be picked up by workers.
Monitoring the status of each project and updating the overall build event status.
Retrying failed builds if necessary.

The controller is also lightweight and does not communicate directly with workers; instead, it monitors status by querying the build status storage.

Worker

Workers handle the actual build processes. There are several worker clusters, each comprising multiple worker nodes. Nodes within the same cluster use identical Docker containers and are deployed together. Build projects are evenly distributed across worker nodes, allowing them to be built in parallel.

By creating isolated workspaces within the container, a worker node can ensure that each project operates in its own environment without interference from others.

The build process inside a worker follows a sequence of activities, as shown in Figure 4.

*Figure 4: build process will trigger a list of activities in sequence*

The build configuration can be customized to define which activities to include in the process and whether their results are part of the critical path. This provides flexibility to specify which activities are mandatory for success, while others can be marked as optional, allowing their results to be disregarded even if they fail.

Dynamic configuration

Both the controller and worker have access to a dynamic configuration component, allowing users to make customizations at any time, including such things as setting the number of parallel jobs, selecting the architecture for building projects, and enabling test sharding. Importantly, these changes can be applied without redeploying the distributed build service, which offers flexibility and ease of adjustment.

Build status storage

The web, controller, and worker components are decoupled and communicate through a message queue, which means the controller doesn’t track which worker node is handling which build project. Instead, build status is shared via a centralized storage system. Workers update the build status for each project, while the web and controller components query this storage to monitor build outcomes. We opted to use a Redis cache for this storage because these statuses are only relevant during the build process and do not require long-term retention.

Fault tolerance

Build time for an individual project can range from 30 seconds to 30 minutes. During this process, workers may unexpectedly be terminated or suffer from a transient failure, potentially losing all in-progress builds. Because multiple builds often run in parallel on each worker node, it's critical to avoid losing progress without reporting status.

To address this, we implemented a heartbeat mechanism. The controller and worker pods send heartbeat data for running tasks every 10 seconds. The web application monitors the controller's heartbeat and the controller monitors worker heartbeats. If no heartbeat is received for one minute, the task is automatically republished to avoid disruption.

Workload distribution

Building and testing tasks are resource-intensive, consuming significant CPU and memory. While most build systems adapt to available CPU cycles for parallel tasks, managing memory usage is essential to prevent failures due to out-of-memory errors. Given that multiple build jobs run on the same machine, careful workload distribution is necessary.

To solve this, we introduced a resource token mechanism. We track each project's historical memory usage and map the total memory of a worker machine to a pool of tokens — for example, if 5 GB equals one token, then a machine with 60 GB of memory will have 12 tokens. Projects must acquire the tokens required for their memory usage before building. If there aren't enough tokens, the project is assigned to another machine.

Auto-scaling

Worker nodes are long-lasting Docker containers that persist after build jobs complete, allowing new jobs to reuse the same nodes without warm-up time. This persistence also retains local cache, improving build times for subsequent jobs. However, during low build request volumes, it wastes resources to over-provision worker machines. To prevent this, we implemented a two-dimensional auto-scaling policy:

The number of worker nodes adjusts based on U.S. working hours, with more nodes active during the week and fewer at night and on weekends.
Worker node count dynamically scales based on the project queue size and the number of jobs unable to find machines with sufficient resource tokens. This policy ensures the system can efficiently handle sudden spikes in build requests.

Building Kotlin monorepo

DoorDash’s Kotlin monorepo contains hundreds of Gradle projects encompassing various services and libraries. While these projects share a common structure, they have diverse requirements for building and testing. The system is designed to meet all these needs, enabling highly efficient parallel building and testing.

Composite build

The Kotlin monorepo uses a composite build structure made up of individual, independent Gradle projects. By including dependencies in settings.gradle, one project can depend on other projects at the source level. Scanning all settings.gradle files creates a reverse dependency graph that is essential for calculating changed targets. The independence of each project supports the distribution of builds across multiple machines.

Isolated Gradle daemons

Project workspaces and Gradle daemons are isolated to prevent race conditions when building multiple projects concurrently on a single worker machine. Each worker machine is prepared with multiple GRADLE_USER_HOME directories to match the number of parallel builds allowed. Each build job runs with its own Gradle daemon and local cache. To optimize initial builds, common preheated build caches are copied into these directories during setup.

Variety of build architectures

DoorDash services run on both x86 and ARM architectures which require projects to be built and tested on different types of machines. By using a decoupled architecture, different project queues can be directed to suitable worker nodes as shown in Figure 5.

*Figure 5: How we manage different types of architectures*

The controller routes projects to the appropriate queues based on their type. Kafka is used as the messaging system, with each project queue represented by a different Kafka topic.

At DoorDash, we support DinDs, or Docker in Docker, projects, which involve launching additional containers such as CRDB or Kafka for unit testing. These containers must run in privileged mode to access the host machine’s Docker daemon.

One challenge of running multiple DinD projects in parallel on the same machine is the potential to create port conflicts, in which containers attempt to connect using the same port. To avoid this, we mandate that projects use test containers configured to select random ports during functional testing, as detailed in this earlier article.

Test sharding

Test sharding splits a project's unit tests into smaller groups, or shards, that run in parallel, significantly reducing overall test time. Users can configure projects for test sharding and define sharding strategies through the runtime configuration component. Worker nodes apply the strategy, split the tests, and send each shard to the project queue for parallel execution by other worker nodes.

Building protobuf monorepo

The DoorDash protobuf monorepo differs from the Kotlin monorepo in a few key ways:

All domains must be published as artifacts for consumption outside of the protobuf monorepo.
Dependencies on other protobufs within the monorepo are referenced by version using the artifactory, rather than composite build.
Artifacts must be produced for different language targets as required by each domain.

These differences mean we must use a slightly different build strategy that is customized to the workflows required by the protobuf monorepo.

Different languages

Protobufs allow microservices that are written in different languages to communicate via the gRPC protocol. These microservices need the protobufs to be made available to them within their respective build systems, which means the protobuf monorepo must publish artifacts for each language the domain specifies.

Each domain specifies within its configuration file all the languages to which it must be compiled. The build service parses the configuration files for each domain within the changed targets, then splits the domain into separate jobs for each language. The build process for each language is independent from that of any other language, so we distribute each domain and language pair as an independent job across workers, as shown in Figure 6.

Each language also requires its own compilation logic. Because workers operate on only one language within a job, they can easily determine which compilation logic to invoke.

Publishing artifacts

The protobufs within the monorepo are used in a variety of other repositories across DoorDash. The protobuf source files are not local to these other repositories, so we publish them to an internal artifactory account where they can be referenced as dependencies.

The protobuf dependencies are useful to developers both during development and iteration on the protobuf schema and as a production-ready dependency once the schema is solidified. To support this, the build service receives a flag determining whether the current build should be performed in development (CI) or production (CD) mode.

When in CI mode, the build service automatically publishes snapshot dependencies with the developer’s changes. The snapshots are published to a separate artifactory, and the build service modifies the protobuf’s semantic versioning to match the language’s specification for snapshot releases.

When in CD mode, the build service publishes compiled protobuf artifacts to the production artifactory. Build steps that were already verified in CI, such as linting and formatting, are skipped because they are unnecessary.

Auto-formatting and updating pull requests

The protobuf monorepo sometimes requires that the build service push changes back to a pull request after building it. Specifically, the protobuf monorepo contains OpenAPI documentation, Golang files, and gRPC Gateway files which are generated automatically. We also take advantage of the build service’s write capability to automatically format code for developers, rather than just failing the CI when formatting issues are detected.

Generated code is produced by a variety of different steps within the build service. These steps are distributed across different workers running in parallel, meaning we can’t commit the changes as soon as they’re generated without running into race conditions.

To prevent race conditions, each domain and language combination that produces code is exported to a git patch file. The patch file is tagged with the build’s unique identifier and the current domain and language combination before being uploaded to a short-lived S3 bucket. After all build steps have been completed, we run a specialized “Update PR” step, as shown in Figure 7. This step lists all of the patch files from the S3 bucket and applies them to the branch as a single commit, which it then pushes before marking the build successful. Each language’s generated code maps directly to a unique output directory, safely applying patch files without merge conflicts.

Key results

The distributed build service has allowed builds at DoorDash to scale efficiently as monorepo adoption increases. Among the key results are:

65% faster Kotlin builds while building 290% more projects compared to the legacy build system; we went from 85 projects in 70 minutes to 330 projects in 25 minutes.
The ability to handle more than 40 concurrent build requests, with more than 400 distinct builds during peak hours
75% reduction in average build time; while the legacy system required 16 minutes on average, the Protobuf Monorepo requires only four minutes

Conclusion

We have implemented build services both for our Kotlin and protobuf monorepos. These systems utilize a distributed architecture that can easily scale up or down as the number of active builds changes over time. This has resulted in faster and more reliable builds in the CI and CD pipelines, which in turn improves developer velocity across all of DoorDash. We’re excited about the flexibility that owning our build service provides us and we look forward to continually iterating the service as DoorDash’s build requirements evolve.