The ability to attach auxiliary metadata to requests within a large microservice architecture enables powerful use cases, such as infrastructure-level sharding, language localization, and testing-in-production. Adding this context to requests allows services and infrastructure libraries to make local decisions, and it can be used by infrastructure components in the directed acyclic graph that requests follow. Although there are large benefits to enabling context in service-to-service requests, propagating this information for all our microservices is a challenge.
To provide context to our requests, DoorDash is pioneering the adoption of the open source OpenTelemetry project to solve observability challenges for its diverse and expanding microservice architecture. OpenTelemetry relies on context propagation to stitch together the telemetry data for a particular request. With the dearth of open source or third-party solutions for custom context propagation, OpenTelemetry comes closest to being an off-the-shelf offering, which is why we pragmatically chose it over other options.
In this article we will go over our experience adopting and tailoring OpenTelemetry to propagate custom context in order to power a number of critical use cases. We will take a deep dive into how custom context improves our services, how we implemented OpenTelemetry-based propagation, and how we rolled out new versions of OpenTelemetry and handled security concerns.
Diving deep into custom context use cases
DoorDash uses custom context to power a number of important use cases. Remote Procedure Calls (RPCs) that microservices rely on to delegate work to other services use a standard transport protocol like HTTP or HTTP/2, and an encoding format like Protobuf, Thrift, or JSON to transmit requests and responses over the wire. Each service serves incoming requests using the data provided in the request. However, sometimes it is useful, or in some cases even required, to include additional data with the incoming request. One such example is having authentication tokens for the actors involved in a transaction. The authentication typically happens closer to the network edge and the resulting token can be passed as a protocol header instead of a separate request field for the service call graph.
Another use case is testing-in-production, which allows test traffic to flow through the production deployment. We attach a tenant-id context to every request, distinguishing test traffic from production traffic, letting us isolate data to ensure test traffic is not mutating production data. The data isolation is abstracted in the infrastructure libraries, which use context to route the traffic to specific infrastructure components like databases and caches. With large-scale microservice deployments, the industry is converging on testing-in-production for reliable testing with lower operational overhead.
Many of the use cases that rely on context propagation are critical for running our normal business operations. This puts stringent reliability and correctness requirements on the context propagation infrastructure.
Context propagation with OpenTelemetry
For propagation, the context can be embedded right into the request itself, for example, by modifying the request’s Protocol Buffers. However, a more flexible approach is to propagate the context as a protocol header. Using headers to propagate context scales especially well when there are a diverse set of services involved and when context needs to be propagated for most of the endpoints exposed by the services. Another advantage of using the header for propagation is that the caller does not need to explicitly add the context to the outgoing calls as the propagation can be implicit, hence adding a context becomes a less invasive change.
OpenTelemetry requires propagation of trace headers. This includes the tracing IDs and vendor-specific headers. OpenTelemetry provides auto-instrumentation to help propagate trace headers across thread and service boundaries. Auto-instrumentation covers an increasingly large variety of libraries and frameworks across different languages. This is especially true for Java/Kotlin, which is used by most of the DoorDash backend services.
Some notable features of OpenTelemetry’s context propagation are that it:
- Is available through auto-instrumentation.
- Supports libraries and frameworks in a variety of languages that we use at DoorDash, including Java/Kotlin, Node, Python, and Go.
- Is a vendor agnostic propagation format, which includes open formats like the W3C's Trace Context and Baggage.
- Supports synchronous flows like HTTP and HTTP/2, and asynchronous flows like Kafka.
OpenTelemetry supports multiple formats for propagation of context including Baggage, a format specifically designed for propagating custom context.
OpenTelemetry propagation formats
OpenTelemetry supports a variety of propagation formats, like Trace Context, Baggage, Zipkin, and B3. At DoorDash we are standardizing on Trace Context for tracing data. For custom context propagation we are standardizing on Baggage.
A close look at OpenTelemetry’s propagation formats
Trace Context defines two headers: traceparent and tracestate.
A traceparent header, shown in Figure 1, helps uniquely identify an incoming request. It contains version, trace-id, parent-id, and trace-flags. This header helps stitch together the spans that a request generates as it flows from one component to another.
The tracestate header, shown in figure 2, contains a key-value pair of arbitrary data that allow additional identifiers to be propagated along with the traceparent header. This header contains key-value pairs delimited by commas.
Tracestate can be used to propagate custom context, but there are a few limitations. The standard recommends the size of the header can be limited. Although this is not a hard requirement and the limit can possibly be increased by making it configurable, if it is changed it will need to happen for every service.
Baggage, shown in figure 3, is designed to propagate custom context that has much higher limits on the actual size of the data being propagated. It defines a header called baggage, which is very similar to tracestate.
As shown in figure 4, custom context can be defined as a key-value pair similar to tracestate. Additionally, tags or properties can be defined for the key by appending them with semicolons.
We abstract away the storage/retrieval of the custom context in helper libraries for all the common languages in use at DoorDash. Service owners can introduce a new custom context by adding it to a central configuration, shown in Figure 5, which also serves as an allowlist. The configuration is a simple JSON allowing service owners the ability to define certain properties of the context.
{
"test-workspace": {
"max_length": 16,
"allowed_from_client": true,
"short_name": "tws"
},
"tenant-id": {
"max_length": 16,
"allowed_from_client": true,
"short_name": "tid"
},
...
}
Figure 5: This custom context allowlist shows two fields, test-workspace and tenant-id, with three properties each specifying maximum length allowed for the field, a flag to indicate if the field can be propagated from the web/mobile clients, and a short name used for actual propagation.
By introducing a custom context library, shown in figure 6, we can change the underlying implementation for context propagation. For example, this approach provides flexibility in using a distributed cache like Redis for larger context and propagating only the cache reference using the OpenTelemetry headers.
Eventually we envision having OpenTelemetry-based propagation right from our mobile and web clients. For now, we use raw protocol headers to propagate context from the mobile and web clients. Figure 7 details the flow of headers as the request travels from the web/mobile clients to the backend services. We use automatic instrumentation for onboarding the supported services to OpenTelemetry. OpenTelemetry-based propagation begins at the backend-for-frontend (BFF) services. Additionally, the incoming raw protocol headers are transformed into OpenTelemetry headers, which are then propagated to the backend services using the OpenTelemetry auto-instrumentation feature.
It is important to note that the sampling policy for the OpenTelemetry traces does not affect the propagation of context. The sampling policies only affect collection and aggregation of the traces.
Rolling out new versions of OpenTelemetry
Being one of the early adopters of OpenTelemetry, we had to keep up with the rapid churn of the open source tooling and the frequent releases, including incompatible API changes. We quickly realized that we would potentially have multiple versions of the OpenTelemetry tooling deployed in production. Fortunately, the open propagation format helps preserve header formats across versions. However, we do have to deal with tracking library versions that depend on specific OpenTelemetry versions. Bumping up the OpenTelemetry version sometimes requires bumping up versions of related libraries of services en masse. We have been exploring tools to facilitate auto updating of the library versions, including some home grown tools.
The rollout of a new OpenTelemetry version is handled with caution given the rapid development within the project. In order to contain the fallout, we have devised a way to selectively roll out a new version to a portion of the fleet and gradually ramp up as we build confidence. That being said, because critical use cases rely on context propagation, it is imperative that context is being propagated regardless of the OpenTelemetry version a service is using.
Addressing security considerations
With OpenTelemetry auto-instrumentation, the headers are propagated implicitly and unconditionally. While this simplifies adoption, it poses the risk of exposing potentially sensitive context to third-party entities that are being called. Although auto-instrumentation can be disabled for library propagation, it cannot be disabled selectively based on the network targets. The risk of exposure also applies to third-party entities calling into DoorDash, which might bring in irrelevant context that we might prefer not to be propagated to DoorDash services. To address this, we drop all OpenTelemetry headers other than traceparent both at ingress and egress of the DoorDash network. This prevents unwarranted injection of context from outside the network, as well as exposure of internal context to the outside network.
Library abstraction for custom context allows us to optionally encrypt just the headers if the service-to-service traffic is not encrypted. This provides an additional layer of security preventing exposure of potentially sensitive data.
Conclusion
Use of context propagation for propagating cross-cutting and frequently required business context is pervasive in a rapidly growing microservice architecture. OpenTelemetry offers a solution that not only enables distributed tracing in a vendor-agnostic manner but also provides easy-to-use open source tooling for a variety of languages and platforms. With certain security and rollout guardrails in place, custom context propagation via OpenTelemetry can help accelerate the use cases that come to rely on it.