On May 12th DoorDash had a system-wide outage that lasted three and a half hours. During this incident, many Dashers were unable to accept new deliveries, consumers had trouble ordering from our mobile app, merchants prepared some meals that went undelivered, and many customers experienced delays in the delivery of their food. This was the result of a cascading failure within DoorDash’s backend services and infrastructure.
Since mitigating the incident, DoorDash has refunded customers and paid merchants for orders that were canceled during this period, and compensated Dashers for orders they were unable to complete. Dasher ratings that were less than five stars during the outage have been removed from their account history.
As an engineering team, we hold ourselves to a high standard and are always working to ensure that we are offering the highest quality of service to the communities we serve. We would like to sincerely apologize to all of our affected Customers, Dashers, and Merchants for any inconvenience caused by this incident.
DoorDash operates a complex three-sided platform and we have service interruptions from time to time that we are typically able to mitigate quickly. The duration and impact of this incident is noteworthy and we want to share how this incident occurred, what led to the sustained duration, and how we will improve going forward.
Reviewing the incident timeline
At 9:40 am, our storage team began a routine operation to reduce the capacity of our delivery service database. The delivery service database is a critical dependency of our order and delivery flows. Several months earlier, we had completed the migration of delivery data from a legacy database to a fully distributed database and now wanted to reduce the cluster capacity to be more efficient. When downsizing a database cluster, data is redistributed between nodes, and query latency is expected to increase marginally, but not enough to cause an impact to our service. Downsizing in small increments is an operation that we had performed numerous times before without issues, so we didn’t anticipate any problems.
At 10:29 am, our logistics team and some of their calling services received alerts for elevated p99 latency. The delivery service’s latency SLO wasn’t impacted by the increased latency at that time, but at 10:38 am, the storage team made a change to dial down the rate of downsizing data replication to reduce latency, alleviating concerns about continuing the operation. There were related latency alerts again at 11:30 am and 2:45 pm, but knowing the database operation was ongoing, and without impact to the SLO, no action was taken.
At 3:57 pm, our logistics team and their dependent teams received another batch of latency alerts and at 4:04 pm, they were paged to investigate errors. The teams assembled on an incident call at 4:07 pm. Dashers were experiencing significant errors when attempting essential functions such as accepting orders and confirming drop-offs. This impact to Dasher flows was correlated with the delivery service database latency and resulting service errors we had been alerted for. At 4:16 pm, the storage team paused the cluster downsizing operation as an attempt to mitigate the issue.
When the team paused the downsizing operation the incident unexpectedly got much worse. At 4:16 pm, we were alerted by our synthetic monitoring system of failures to www.doordash.com. This alert was a surprise as the website does not have an obvious dependency on our delivery service database. We then began receiving reports of widespread impact to our consumers’ customer experience. Engineers began reporting various issues they were seeing across many different flows and services, including degradation in our Drive product and our ability to fulfill orders. Over the next 45 minutes, we continued investigating the errors and disruptions, but with the widespread issues happening at the same time, we failed to identify a clear signal on what the root problem was. Without this signal, we attempted to mitigate by restarting various logistics services, but this did not help us identify a root cause or resolve the incident.
Unknown to members of the team handling the incident, at 4:12 pm our Traffic team had been alerted that we were hitting our circuit breaker limits in Envoy. The team had been running the Envoy traffic infrastructure for over a year and this was the first time they had received this alert. Given our lack of experience with this error, the magnitude of hitting this limit wasn’t well understood by the team. They investigated and took action by increasing several Envoy configuration limits, but ultimately we were not able to mitigate the system-wide failure with this alone.
At 5:30 pm, we decided to turn off customer traffic at our edge and prevent consumers from using the app to place orders, believing this would allow for our systems to recover and clear request queues. We gradually ramped traffic back up over the course of the next 30 minutes. From 6:00 pm to 6:22 pm there was a partial recovery with 80% of our normal volume of orders from the consumer app being processed.
From then on, we continued seeing an impact to our Dasher customer flows as well as intermittent wider instability of our services. At that point, we understood that Envoy circuit breakers had opened and were likely causing the impact beyond the Dasher flows, so we narrowed focus back to removing the delivery service database latency that we originally detected. We performed various mitigation actions related to the delivery service and its database infrastructure. This included scaling out the number of service instances and database proxy instances to accommodate the increased latency, and restarting some database instances to undo configuration changes made during the incident. This, together with some of the previous mitigation efforts on Envoy, helped us eventually see clear signs of recovery.
At 7:30 pm, our services were finally healthy and stable.
Root cause analysis
Since the incident, engineering teams have spent many hours fully investigating the root cause in great detail. For the purpose of this post the root cause can be summarized as follows:
- Our planned database maintenance increased query latency more than we had expected based on prior similar maintenance operations.
- Increased database latency caused increased response latency from the delivery service.
- Increased delivery service latency had a wide-ranging impact on its dependent services, including timeouts resulting in error responses.
- The increased request latency, along with increased traffic due to retries, caused an increase in connection utilization at multiple points within our Envoy infrastructure. We reached limits for both active connections and requests which caused a large proportion of requests passing through Envoy to be rejected (Envoy returned 503 to callers). Because multiple services shared the same Envoy cluster, this broadly impacted customer flows.
- In an attempt to mitigate the increased delivery database latency, we modified database timeouts and restarted, which caused delivery service errors and tail latency to worsen. This mitigation attempt together with a surge in customer traffic caused Envoy to reach limits again and the customer experience to further be impacted.
Although this failure started with increased latency from a database operation, hitting the Envoy limits resulted in this incident going from a relatively localized issue to a widespread failure of our systems. Its widespread nature caused a lot of noise in our signals and a more chaotic incident response that extended the incident dramatically.
Learnings and improvements
Here are some of the key issues and follow-ups.
Database infrastructure
Since the incident, we have audited our configuration and usage of this database cluster and now understand that a combination of factors - schema design, suboptimal configuration, and usage patterns - are what caused the additional latency during the resize process. Until those are corrected on this cluster and others, we won’t be performing similar operations. When we resume these operations, we will ensure that we have a better understanding of the latency that our applications can tolerate.
When we migrated delivery data from a legacy database to a new distributed database, we introduced a database proxy to perform the migration and allow for a rollback path. The presence of this proxy added another layer to debug and increased mitigation complexity, so we have expedited its safe removal.
Traffic infrastructure
We had been running Envoy for about 12 months in production. We left the circuit-breaking configuration at the defaults, without sufficient consideration. We lacked a clear understanding of the impact that hitting this limit would have on our customers. Though we were alerted quickly and had detailed dashboards, we also identified some areas of improvement to Envoy observability. We are following up with an audit of all critical Envoy configuration settings, improving our understanding and enhancing our visibility with more metrics and alerts, and enabling distributed tracing. One thing we learned about circuit breakers in Envoy is that they are essentially rate limits and not traditional circuit breakers. This has been a key learning for us.
Multiple services share the same east-west traffic infrastructure. In this configuration, the circuit breaker is shared for all traffic passing through it. This means that a single service under heavy connection or request load can cause the circuit breaker to open for other services as well. As part of immediate incident follow up, we have effectively disabled shared circuit breaking. We will next look at traffic routers per domain for better failure isolation.
Conclusão
We want to again sincerely apologize to our community of Customers, Dashers, and Merchants who count on us to access opportunity and convenience. As an engineering team, we would also like to give credit to and thank the teams across DoorDash who worked to handle support cases, issue refunds, and help to make things right for our customers.
We will learn from this incident and apply those learnings to improve our infrastructure and reliability in 2022 and beyond.