Despite their widespread adoption, Kubernetes probes are rarely fully understood, which can cause unintentional outages if internal understanding of their functionality is insufficient. Even though deep technical understanding is often limited, there is broad familiarity across the industry with readiness, liveness, and startup probes for containers. Most of the industry is using these widely across their organizations.
The Kubernetes community did a particularly good job in marketing these probes; they exist in almost every open-source application running on K8S and any templates or samples found on the wild internet. As a result, these probes have been viewed as “health checks” and have been widely adopted without the majority of users having a good understanding of what they are actually doing under the hood.
DoorDash is no exception; with a robust adoption of K8S in the last one and a half years, we’ve incorporated these “health checks” into almost every single application of ours, without giving much thought to whether we needed those particular probes, and what the “health check” endpoint used in the application does under the hood. After a particularly bad outage on Black Friday, we learned that health checks could cause serious problems if not implemented correctly, which is why we decided to share our knowledge with the community in this post. First, we will talk about the outage in question and how things went wrong, and then we will share the actionable steps we took to prevent this kind of problem in the future.
Our health checks outage on Black Friday
Because the DoorDash team lacked a deep technical understanding of the health checks we were running, we experienced an informative incident on Black Friday, a typically busy holiday for DoorDash.
Closer to the end of the day, our engineers received various alerts that our Tier 0 service was experiencing issues. Specifically we saw:
- CPU utilization spiking
- Increased response Latencies
- SLOs were burning
- Reports were coming in from other services with failing dependencies
Our incident management tooling allowed us to quickly assemble the incident response team with relevant counterparts and start diagnosing triggers to mitigate the impact.
We were able to assemble an approximate diagnosis and the timelapse:
- Large amounts of Pods were failing the readiness probes and were removed from the Kubernetes Service
- The remaining Pods were quickly overwhelmed by being forced to handle a majority of the requests, skyrocketing CPU
To mitigate the initial impact, we disabled the readiness health checks on the Pods and service functionality was restored.
Understanding what failed
After the service was back to serving traffic normally, we had the opportunity to look into what exactly happened and what action items needed to be completed to avoid the problem in the future. Typically when dealing with an outage of this nature, it’s important to look at metrics, traces, and logs. In this section, we will outline our examination of each of these.
Looking at metrics to narrow down the search
Generally, the first place we look during an outage is our metrics. Metrics tend to provide the data on what is failing and in what amount, e.g.: a particular endpoint is returning 503 error codes and is doing so 90% of the time. For this particular outage, our metrics indicated only an overall increase in latency on all endpoints and the failing of a Kubernetes readiness check, which didn’t narrow it down to a particular failure. Given that the metrics were not providing much insight, the next step was to take a look at traces.
Using traces to track down individual application requests
After narrowing down the failure to a particular endpoint or determining that the metrics were not helpful, the next step is looking at the traces. Traces provide in-depth information on what functions were executed during a single request. In comparison, checking logs is often more difficult because they can be poorly formatted, and it can be very challenging to find issues without knowing exactly what to look for.
To analyze our Black Friday incident further, we looked at our tracing data. What we found was that health check endpoints were excluded from the reporting tracing data. This meant that the tracing data was not going to help us find what caused the health check to fail and we would need to check the logs as the next logical step.
How we used logs to find what happened
Since we were not able to find the cause of the health check failures with metrics or tracing data, we needed to turn our attention to logs. From looking at the traces, we know that health check endpoints were also ignored from Application Logs, making them less useful in this case. However, we also had logs from our eBPF agent, a piece of software that runs adjacent to all our services and collects data on all TCP, UDP, and DNS requests performed by the service.
From these, we found a latency increase towards one of our Redis servers and a drop in request volume towards it when we disabled readiness checks. What was important, however, was that the latency increase on the given Redis server was on a legacy path to be removed and should not have impacted our application. Nevertheless, it turns out the health check endpoint/path used in our readiness check was a default health check provided by the Spring Boot Framework. The provided default health check endpoint performed various different configurable smaller health checks, enabled by default, one of them being Redis.
Testing our outage theory
After determining the hypothesized origin of the failure, the theory must be confirmed and then verified upon implementation of the solution. To test our theory, we used Chaos Engineering, which allows injection of failure into various aspects of a system in order to proactively find failures that can negatively impact the system. For this case, we have used Litmus Chaos, an open-source Chaos Engineering platform that enables the injection of various failures through targeted experiments. We configured an experiment called Pod Network Latency, which added a one-second latency to all calls to the Redis server, previously identified as a source of health check failures. One second was chosen because our readiness check timeout was also set to the same value. With the experiment enabled, we saw readiness checks starting to fail in a similar manner as during the outage.
Documenting action items
Once all the sources of failure have been found and confirmed, it’s important to create tickets and document all action Items to avoid similar problems in the future. In our case, we first worked on configuring the health endpoint provided by Spring Boot so that it only performed checks on relevant dependencies. Then we documented the findings and proactively reached out to every team/service using Spring Boot to help mitigate similar issues. Additionally, we have also started an initiative to document behaviors of various Kubernetes health checks and share this knowledge across the organization.
Common health check pitfalls
Based on our findings during this incident and further learnings, we wanted to share what we think are common pitfalls with Kubernetes health checks and actions we recommend to avoid them.
- Lack of understanding of what different Kubernetes probes do
- Use of 3rd party provided health check endpoints with no knowledge into what actions they perform
- Disabled observability around aspects of health checks, including logs, traces, and metrics.
Lessons learned about Kubernetes probes
Throughout this project, our team identified gaps in our knowledge and procedures, and we determined measures to ensure our health checks are more effective and efficient. Here are our recommended steps to avoid similar issues:
1. Understand the different applications of the various Kubernetes probes. Ensure the entire department is aware of these use cases.
2. Verify the applications and options regarding any third-party health check endpoints. Consider disabling certain features on third-party tools.
3. Treat health check endpoints as Tier 0 by instrumenting them with various observability methods and ensuring they are not ignored by the observability tooling. If health checks are providing too much data, consider sampling them or reducing the volume of data they share.
4. Having a health-checks depend on a backend dependency can be problematic, as an outage within your dependency can cause you to have an outage as Kubernetes restarts your containers.