Three key steps are of paramount importance to prevent outages in microservice applications, especially those that depend on cloud services: Identify the potential causes for system failure, prepare for them, and test countermeasures before failure occurs.
Because DoorDash’s complex infrastructure and applications are susceptible to breakage, we need to understand which system failures cause outages. This is ideally done by subjecting backend services to intentional simulated failures in a test environment, where the application’s behavior under failure can be observed. Only after completing this process can we design appropriate countermeasures to ensure that future failures do not become full-fledged outages that affect DoorDash’s customer experience.
In this post, we discuss both nascent and more traditional resilience engineering approaches and their application to large-scale microservice applications. We also introduce a new approach – service-level fault injection testing – that was explored on the DoorDash platform during my recent research internship experience.
Enter Filibuster: An automated resilience testing tool
As a PhD student at Carnegie Mellon University, I have been working for two years on developing an automated resilience testing tool called Filibuster in the Composable Systems Lab of the university’s Institute of Software Research. Filibuster’s design is based primarily on the study of grey literature – conference presentations, blog posts, white papers, and other material not produced by commercial publishers – to identify resilience bugs that have caused outages in order to better understand how they can be prevented in the future. While Filibuster was designed with practicality in mind, the ultimate goal of the academic research is to be able to adapt and apply research ideas in an industrial setting.
As a result of DoorDash’s growing interest in improving platform reliability, I joined DoorDash as an intern during the summer of 2021 to test Filibuster’s applicability to the DoorDash platform. My work produced positive preliminary results along those lines, while also affording me an opportunity to extend Filibuster’s core algorithms and to implement support for new programming languages and RPC frameworks. Access to DoorDash’s real-world industrial microservice application was extremely valuable both to ensure that Filibuster’s design matches how microservice applications are developed today and to influence future features that I would not have identified in a lab setting. Experiences like these are a win-win for both the research community and the host company!
Background: Microservices and their complexity
Backend services at DoorDash are written using a microservice architecture, a style that allows independent teams, focused on one specific component of the business, to deliver features at their own pace and then independently scale those services to meet demand. As is the case at DoorDash, microservice architectures typically are adopted in order to increase developer productivity and improve feature delivery as the business scales up. Microservice architectures typically are not adopted for technical reasons alone; they take a simplistic, easy-to-test monolithic application design and convert it into a harder-to-test, harder-to-reason-about distributed system.
Distributed systems are notorious for their complexity. Once a monolithic design has been broken apart into its constituent services, the handling of a single end-to-end user request from a mobile application may involve tens if not hundreds of different services working together in concert. This decomposition requires that developers now consider a new type of application complexity – partial failure, in which one or more of the services that a single process depends on happens to be in a failed state. More practically, it forces developers of these applications to ask questions like the following: What happens when one of the necessary services goes down or becomes unavailable? What happens if a needed service is slow? What happens if one of the required services deploys with a bug and starts returning errors when called?
Unfortunately, developers must have answers to all of these questions. Cloud providers do not provide 100% uptime guarantees because failures do occur; software engineers are not perfect and therefore they write bugs that sometimes get past manual testing processes. Even if it were possible to have perfect pre-production testing to ensure that bugs never made it into production, DoorDash relies on a number of third-party services where bugs may exist. In short: Outages happen.
Failure is not something that can be avoided entirely. It is inevitable and must be anticipated and planned for.
Chaos engineering and when it can be used in production
Over the last 10 years, chaos engineering has emerged as the primary discipline to address this reliability problem. It first acknowledges that failure occurs and then subsequently focuses on the organizational and/or technical responses needed when services inevitably fail.
Chaos engineering was pioneered by Netflix when it moved from physical servers to AWS. The first instantiation of chaos engineering was Chaos Monkey, a tool for automatically and randomly terminating EC2 instances to verify that Netflix could withstand such failures in production. It has since expanded to an entire suite of tools. For example, Chaos Gorilla simulates the failure of an entire availability zone in AWS and Chaos Kong simulates the failure of an entire AWS region. Netflix since has devoted significant time to automating fault injection; entire production clusters now can be set up to run a chaos experiment on a very small percentage of Netflix production traffic and then can be torn down automatically by the company’s CHaP and Monocle fault injection systems. Former Netflix employees recently formed Gremlin, a company for providing chaos-engineering-as-a-service, which already has been used by major e-commerce sites.
Netflix espouses the value of doing this style of chaos engineering directly in production. Some Netflix users might not care if they have to refresh the Netflix homepage when they happen to fall into a chaos engineering experimental group and encounter a bug. But some users do care; Netflix lets them opt out of the experimental group. DoorDash, however, can’t expect a customer trying to place an order to tolerate a random app failure, particularly not during a chaos engineering experiment. Rather than simply refreshing the page as in the Netflix case, that DoorDash user likely would become frustrated and stop using the platform. Additionally, while there are loyalty programs like Dash Pass there is still a low switching cost for consumers, which is why providing a quality customer experience is essential to not losing business to competitors over the long haul.
As a result, chaos experimentation is impractical for applications and companies like DoorDash.
What about chaos engineering in staging or during development?
Alternatively, the same style of chaos experiments could be run in a staging or local development environment. This is precisely what Gremlin and developers of local-first chaos engineering tools – for example, Chaos Toolkit – recommend as a starting point. Chaos experiments in staging can be used successfully to test an organization's response to failure – for example, on-call response or validation of runbooks. But from a technical perspective, using chaos engineering at this stage highlights some of the main downsides of the whole chaos engineering approach.
First, we are not only concerned with failures that render a service either unavailable or slow. We are also concerned with understanding the entire application’s behavior when a particular service begins returning an error. The new error may be caused by the failure of a dependent service further down the chain, or the error may have occurred because a dependent service was redeployed with a bug. These types of failures typically are harder to generate through traditional chaos engineering approaches that rely solely on exhausting memory of service instances, rendering them unavailable on the network, or simply crashing them. The failures we are concerned with occur at the application level, not the infrastructure level. If a bug is deployed in a downstream service, we wish to harden the implementation of individual services in order to make our overall application more resilient. For a fault injection approach to be maximally useful, it should also consider application-level failures.
Second, without taking a systematic approach to fault injection, it is difficult to guarantee overall application reliability; random experimentation may miss the one particular failure that will result in an outage. Most chaos engineering approaches and tools used today rely on manually specified experiment configurations. This places the onus on the developer to devise and manually specify the possible failure scenarios they want to test – for example, Gremlin or LinkedOut – and neither systematically nor exhaustively search the space of possible failures. To guarantee that a service will act a particular way when a failure occurs, we need to test that service for that failure. Therefore, we believe that any fault injection approach should generate and execute the fault configurations automatically to provide these guarantees of resilience.
Chaos engineering as a discipline originated from concerns about infrastructure reliability and the technical and organizational responses to the inherent lack of reliability. But the reliability of microservice applications extends far beyond the infrastructure level to software defects in the application and ensuring resilience against those defects. While chaos engineering is an incredibly useful technique for identifying infrastructure-level defects and testing the organizational response, it is important to identify the potential application-wide effects of those defects early in the development process by building services in a reliable and defect-free way.
Filibuster: Automating resilience testing of microservice applications
Filibuster is a tool designed to automatically test microservice applications for resilience. Filibuster operates from the perspective that any low-level system resilience issues such as service failures or timeouts will manifest themselves in the application layer as errors or exceptions alongside any application-level failures when issuing interservice RPCs. To identify a microservice application’s resilience issues, it is therefore only necessary to enumerate these errors, synthesize them in the application, and then verify the application behavior under fault through systematic exploration.
To do this, Filibuster relies on a number of techniques that work together in concert:
- First, a static analysis identifies the possible errors visible – either thrown or returned – at each RPC call site within the application.
- Next, Filibuster systematically injects these errors first one by one and then in combinations while repeatedly executing a functional end-to-end test of the application to ensure that it continues to behave desirably until the space of possible failures is exhausted. This approach avoids the overhead of requiring developers to write unit or integration tests manually that contain mocks for all failures and combinations.
- Because developer-written functional tests typically do not contain a test oracle that considers the failure of one or more service dependencies, developers are forced to think about what the system should do when one of the dependent services inevitably fails and an assertion in their functional test fails.
- Once developers know what the desired behavior under fault is, they can encode this behavior directly into the functional test using Filibuster-provided conditional assertions. For example, the application returns 200 when all services are online; if service X is down, the application will return error code Y.
- You can think of this iterative process like resilience coaching. Through the use of Filibuster, developers are made aware of a failure scenario that they did not anticipate and are forced to think about what the application should do before writing an assertion to capture that behavior.
We like to think of this process as resilience-driven development.
In short, Filibuster can identify many of the technical resilience issues that exist in applications earlier, at the time of development, without testing in production as required in more traditional chaos engineering techniques.
Two key Filibuster features
Because no single design suits all organizations, Filibuster was designed with configuration in mind. Without covering all of the possible Filibuster configurations, we highlight what we believe are two key features for organizations concerned with adopting resilience testing – deterministic replay, a technique that allows developers to write regression tests for outages or previous system failures, and runtime fault selection, a way to use Filibuster incrementally to increase resilience test coverage as code moves through a continuous integration pipeline.
- Deterministic replay: When developers are debugging a particular failure, they can, for example, replay the failure while using Java’s interactive debugger to walk through a particular service failure step-by-step. When failures do occur while testing with Filibuster, it also outputs a counterexample file to the file system. This counterexample file can be used to write unit, integration, or functional tests to put the system into a failed state and write tests that simulate past outages to build a failure or outage regression suite.
- Runtime fault selection: Filibuster can be fully configured so that certain faults are tested as part of a local development environment while other faults are tested as part of continuous integration. For example, to keep the local development environment fast, only common faults are tested for each service and a full list of failures is used on each commit to a branch or as part of a nightly release build. All of this can be specified as part of a configuration file supplied to Filibuster to enable test environment flexibility.
Adapting Filibuster to DoorDash
At the start of the internship, Filibuster was designed to test only microservices implemented in Python, using Flask, that communicated strictly over HTTP for RPCs. DoorDash uses Kotlin on the JVM for service implementation and uses both HTTP and GRPC for interservice RPC.
To adapt Filibuster for use at DoorDash, the following had to be done:
- First, Filibuster had to be extended to support Kotlin and GRPC. While extension to GRPC was straightforward, extension to Kotlin and the JVM proved more difficult because of the concurrency primitives that both provided. In fact, this resulted in several modifications to Filibuster’s core algorithms to support RPCs that occurred within concurrent code, which was not a significant concern in Python because of its lack of true concurrency and parallelism.
- Second, Filibuster relied on application-level instrumentation of services to support fault injection. But application-level code modification wasn’t a viable option for DoorDash both because of the effort required and the overhead that would be involved to maintain a change set solely for resilience testing. Therefore, I worked with DoorDash engineers to devise a strategy to dynamically add this instrumentation at runtime, without code modification, to support Filibuster’s required fault injection.
To see how Filibuster can be used on a Java application, review this YouTube video, which demonstrates a concrete result of the internship.
DoorDash has only started their resilience journey with Filibuster. At the end of the summer, a prototype of Filibuster was working on two of DoorDash’s services running in the local development environment, demonstrating the approach’s viability for real-world microservice code. DoorDash plans to extend Filibuster’s reach in the coming months to continue improving the resilience of their application.
To learn more about Filibuster, watch this video (or read our paper) overview of the service-level fault injection technique.
To continue learning more about fault injection research, follow Christopher Meiklejohn on Twitter or through his blog.
Do you find this work interesting and exciting? We’re hiring in the area of resilience engineering at DoorDash! Apply here.