When companies move to microservices, they need to address a new challenge of setting up distributed tracing to identify availability or performance issues throughout the platform. While various tools offered on the market or through open-source perform this task, there is often a lack of standardization, making leveraging these tools costly or complicated.
As DoorDash migrated off its monolith to microservices, we considered these issues and chose OpenTelemetry, an open source distributed tracing project, for its emphasis on telemetry data standardization. To make this implementation successful, we needed to ensure that we could utilize OpenTelemetry without incurring any performance costs that would negatively affect our platform and user experience. To do this, we created a baseline of how OpenTelemetry performed on our platform. We then researched six ways of optimizing its performance and compared them side by side to pick a winner. 
What is distributed tracing?
Before we jump into our benchmarking experiments, let’s review some important terminology. Tracing is the process of specialized logging that records information about a program’s execution in real-time. A trace, for example, contains information about the time a program spends in a function. Traces are generally helpful in debugging software performance issues.
Tracing in a monolith architecture is relatively easy. But in a microservice architecture, where a request travels through many services, it is often difficult to get an end-to-end picture of the request’s execution, making it challenging to identify availability or performance issues through traditional tracing.
Distributed tracing, a method to profile and monitor applications built on microservice architecture, solves traditional tracing challenges. As DoorDash moved away from a monolith architecture to a microservice distributed architecture, we needed distributed tracing to identify and resolve availability and performance issues quickly.
Many open source projects or commercial vendors provide distributed tracing. However, for many of the options on the market, there is a lack of standardization, resulting in a lack of data portability and a costly burden on users to maintain the distributed tracing instrumentation libraries. The OpenTelemetry project solves the standardization problems by providing a single, vendor-agnostic solution.
What is OpenTelemetry?
OpenTelemetry is a collection of tools, APIs, and SDKs to instrument, generate, collect, and export metrics, logs, and traces for analysis to understand software's performance and behavior. OpenTelemetry formed as a merger of OpenTracing and OpenCensus after the developers of those projects decided that it’s better to develop a single, vendor-agnostic approach for enabling observability.
OpenTelemetry supports three data sources: traces, metrics, and logs. At Doordash, we decided to adopt OpenTelemetry for traces and began testing its distributed tracing feature for our platform.
How does distributed tracing work using OpenTelemetry?
Once we decided to use OpenTelemetry, we wanted to look into its distributed tracing functionality. Distributed tracing tracks a single request’s progression across the multiple services that form an application. An example of a request might be a call from a mobile endpoint or a UI action by an end-user. A single trace is essentially a tree of spans, where spans are objects that represent a unit of work done by individual services or components involved in the request. A single trace contains a single root span that encapsulates the entire request. The root span would contain child spans which in turn may have child spans.
Using OpenTelemetry, services generate a trace with a globally unique identifier. OpenTelemetry propagates the trace context to all underlying services where each service generates the spans. OpenTelemetry sends the spans to a local collector, which enhances the trace data with custom metadata. The local collectors send the trace data to a collector gateway, where they are processed and shipped to distributed tracing tools such as Jaeger, NewRelic, Splunk or LightStep for storage and visualization.
The trace identifier generated at the root span remains the same for all the child spans within the request. Distributed trace visualizers use this trace identifier to stitch together the spans generated for the same request, thus providing an end-to-end picture of the request execution across the application. Figure 1, below, outlines the high-level architecture of how we set up OpenTelemetry’s distributed tracing.
Load testing OpenTelemetry to identify performance impact
Any significant increase in request latency or CPU overhead would mean a negative experience for DoorDash users or higher resource usage. Once we decided to adopt OpenTelemetry, we wanted to understand the performance impact of using OpenTelemetry’s distributed tracing functionality in our services.
Using our internal load testing tool, we tested the OpenTelemetry agent attached to a Netty gRPC service written in Kotlin. The gRPC service has two rpc calls, createMessage and getMessages, which use AWS’ simple queue service to store and retrieve the messages. OpenTelemetry’s Java agent is attached to the service with the following span exporter configuration options:
- otel.traces.exporter=otlp (OpenTelemetry Protocol)
- otel.traces.sampler.arg=1 (100% sampled)
- otel.bsp.max.queue.size=2048
- otel.bsp.max.export.batch.size=512
- otel.bsp.schedule.delay=5000 (5 seconds)
We hit the two rpc calls using a synthetic load generating ~500 peak rps. The service deployment has a 1000 millicores CPU and 1000 mebibytes memory allotted in a Kubernetes cluster. We observed 72% pod CPU utilization at peak but only observed 56% pod CPU utilization at peak without the exporter. These experiments show that we have to spend more computing resources to adopt OpenTelemetry in our platform.
At this point, we noticed an increase in CPU usage with OpenTelemetry when the span exporter is enabled. CPU profiling is generally helpful in identifying these kinds of problems. Our internal load testing tool has an option to enable the YourKit Java profiler and automatically collect CPU profiling data from the service under test. We leveraged this profiling option to collect the CPU profiling data to identify the performance issue. The CPU profiling data showed a background thread named BatchSpanProcessor_WorkerThread that is continuously polling spans from a blocking queue.
Note that even though the profiler shows that the batch span processor’s thread is in a waiting state, there is a direct and indirect CPU cost due to the kernel’s context switches. We played with the batch span processor’s (BSP) options as described below to see if they would help reduce the CPU overhead, but they didn’t help.
- Increased batch size otel.bsp.max.export.batch.size to 5120
- Reduced trace sampling otel.traces.sampler.arg to 0.01
- Increased schedule delay otel.bsp.schedule.delay to 300000 (5 minutes)
Once the workarounds discussed above didn’t help resolve the performance problem, we wanted to do a deep dive into the implementation of OpenTelemetry’s BSP.
How does OpenTelemetry’s BSP work?
Once we identified that the BSP is causing the CPU overhead, we inspected OpenTelemetry’s codebase and its configuration options to understand BSP. Figure 5, below, shows the implementation of BSP, the component in OpenTelemetry that processes the spans generated by microservices and exports them to the collector.
The OpenTelemetry specifications outline BSP behavior as follows:
otel.bsp.max.queue.size
controls the maximum number of spans in the waiting queue. Any new spans are simply dropped once the queue is full.- The exporter thread waits until the current number of spans it collected from the queue reaches
otel.bsp.max.export.batch.size
before sending them to the collector in one batch. Batching, as explained above, makes perfect sense to reduce the overhead of any network or I/O calls. - Once every
otel.bsp.schedule.delay
interval, the exporter thread will send any spans it collected from the queue.
Creating benchmarks to establish a performance baseline for the BSP
After going through the BSP’s implementation, we noticed two apparent issues. The first one is the lock contention with the ArrayBlockingQueue. ArrayBlockingQueue uses an exclusive lock to provide threads safe access to the underlying queue. Only one thread holds access to the queue at a time while other threads wait for their turn. This access pattern results in threads doing a lot of waiting, leading to low throughput.
The second issue is the exporter thread’s continuous polling. The exporter thread receives a signal for each new queue entry due to the continuous polling, which causes context switch overhead resulting in high CPU usage. We wrote two benchmarks using Java Microbenchmark Harness (JMH) that surfaced the above two performance issues, then we established a performance baseline and compared the new BSP’s implementations.
Use Java Microbenchmark Harness to write benchmarks
Writing benchmarks using a naive framework is difficult and gives misleading results as many JVM and hardware-level optimizations are applied to the component under test, making the code appear to perform better than in reality. These optimizations are not applicable when the component under test runs as part of a larger application. JMH provides a convenient way to warm up JVM, run multiple iterations, and create multiple worker threads.
The example shown below runs a benchmark measuring throughput (operations per second) with five threads, warms up the JVM for a second, and runs five iterations where each iteration runs for five seconds. Each thread continuously processes a new span for the entire duration of the benchmark.
@Benchmark
@Fork(1)
@Threads(5)
@Warmup(iterations = 1, time = 1)
@Measurement(iterations = 5, time = 5)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public void export_05Thread(
BenchmarkState benchmarkState, @SuppressWarnings("unused") ThreadState threadState) {
benchmarkState.numThreads = 5;
benchmarkState.processor.onEnd(
(ReadableSpan) benchmarkState.tracer.spanBuilder("span").startSpan());
}
Establishing the right metrics for performance comparison
Once we decided to use JMH to write benchmarks, we wanted to establish the right performance metrics to compare the BSP’s different implementations.
The number of span exports per second
Although the primary JMH metric, operations per second, is a good metric in general, the number of spans getting exported is much better for understanding the BSP’s performance and effectiveness. For example, an implementation of the BSP that aggressively drops the spans would have very high throughput, but it does little valuable work because traces have missing spans. JMH provides a convenient way to track secondary metrics such as exports per second. We will be using the metric exports per second going forward to compare different versions of the BSP. Since we are interested in testing BSP in an isolated environment, we faked the network calls that send spans to the collector to do nothing. This kind of faking was necessary to prevent any external noise in the benchmark experiments.
Measuring CPU time of the exporter thread
Measuring CPU usage of the exporter thread is also very important since we are interested in making the exporter thread very lightweight. Measuring exporter thread CPU is very tricky with JMH since it consumes as much CPU as possible to hit peak throughput. We employed a simple hack shown in the example below to run the benchmark in a steady state. Adding a significant sleep in the JMH worker threads, we generated consistent requests per second.
private static void doWork(BenchmarkState benchmarkState) {
benchmarkState.processor.onEnd(
(ReadableSpan) benchmarkState.tracer.spanBuilder("span").startSpan());
// This sleep is essential to maintain a steady state of the benchmark run
// by generating 10k spans per second per thread. Without this, JMH outer
// loop consumes as much CPU as possible making comparing different
// processor versions difficult.
// Note that time spent outside of the sleep is negligible allowing this sleep
// to control span generation rate. Here we get close to 1 / 100_000 = 10K spans
// generated per second.
LockSupport.parkNanos(100_000);
}
While running the benchmark, we used the YourKit Java profiler to measure the CPU time of the exporter thread.
Configuration used for the benchmark
All the benchmark experiments are using the configuration given below.
otel.bsp.max.queue.size = 2048
otel.bsp.max.export.batch.size = 512
otel.bsp.schedule.delay = 5000ms
Benchmark results with different implementations of the BSP
Once we established the metrics, it was easy to compare and contrast the BSP’s different implementations. The goal is to increase the exports per second while reducing the CPU usage of the exporter thread. We will be using throughput to signify the exports per second in the benchmark results. Note that all the benchmark results discussed below are from runs using a 2.4 GHz 8-core Intel processor. When trying to figure out the best batch span processor, we looked at the following options:
- ArrayBlockingQueue
- ConcurrentLinkedQueue
- Disruptor with TimeoutBlockingWait strategy
- Disruptor with SleepingWait strategy
- ArrayBlockingQueue with signal batching
- ConcurrentLinkedQueue with signal batching
- MpscQueue with signal batching
For each of these options, we ran the benchmarks and compared the results. This approach gave us a side-by-side comparison of exports per second and the CPU time, which enabled us to pick a winner and use it in OpenTelemetry’s BSP.
BSP using ArrayBlockingQueue
The implementation using Java’s inbuilt ArrayBlockingQueue is the existing implementation and was used as a baseline. It uses a preallocated array buffer to implement the queue and a lock for concurrent access. Figure 6 and Figure 7, below, show the benchmark results while using this approach.
Our benchmark results show that the ArrayBlockingQueue approach does not scale since the throughput and CPU time get steadily worse as the number of threads increases mainly due to the lock contention.
There is additional overhead hurting the throughput besides the lock contention with the increasing number of threads. A rising number of threads usually puts pressure on the memory subsystem, resulting in decreased performance. Modern processors rely heavily on cache memory to keep a thread’s state. Cache memory is usually one or two orders of magnitude faster than main memory. When we have too many threads, the cache becomes full, and the processor has to evict the cache entries frequently. The evictions result in frequent cache misses leading to main memory access, resulting in decreased performance.
BSP using ConcurrentLinkedQueue
The next obvious choice is to use Java’s inbuilt ConcurrentLinkedQueue. ConcurrentLinkedQueue implements a non-blocking algorithm using atomic compare-and-swap (CAS) supported by modern processors. Given the nature of the algorithm, it is only applicable to an unbounded queue. An unbounded queue is not a great option since we need to allocate memory for every push operation with the queue and deallocate once the entries pop from the queue head. Figure 8 and Figure 9, below, show the benchmark results we got while using this approach.
Using ConcurrentLinkedQueue, we see an increase in throughput and a decent decrease in the CPU time of the exporter thread. However, the throughput flattens out much earlier than expected with the increase in the number of threads. We identified that there are subtle issues with concurrency on a queue data structure.
- We noticed that there is contention on head or tail even with CAS. The head and tail usually occupy the same cache lines, making cache coherence very expensive in modern processors due to frequent cache misses. This performance degradation phenomenon is commonly known as false sharing.
- There is unnecessary garbage collector activity due to memory allocations for new nodes.
- size() implementation of ConcurrentLinkedQueue is actually linear in complexity. Since OpenTelemetry specifications require a bounded size queue, we had to use an atomic counter to efficiently track the queue size and find whether the queue is full.
BSP using LMAX Disruptor
LMAX Disruptor, our next choice, is designed to address the concurrency concerns with queues. The key design ideas used by the Disruptor are:
- It uses a specialized ring buffer with preallocated memory and memory padding to avoid false sharing.
- It uses sequencing, a concurrency technique, to reduce lock contention with multiple producers. Producers claim the next slot in the ring buffer using an atomic counter. The claiming producer updates the ring buffer’s entry and commits the change by updating the ring buffer’s cursor. The cursor represents the latest entry available for consumers to read.
- Consumers read from the ring buffer and have different waiting strategies when waiting for new entries.
Once we picked the Disruptor, we proceeded to the benchmark experiments with the two most common waiting strategies offered by Disruptor, TimeoutBlockingWait and SleepingWait.
Disruptor using TimeoutBlockingWait strategy
We first tried out the Disruptor with the timeout blocking wait strategy. When using this waiting strategy, the Disruptor lets consumers wait for new entries in the ring buffer using a conditional variable. This waiting strategy is similar to how the exporter thread is waiting for new entries using ArrayBlockingQueue or ConcurrentLinkedQueue. Figure 10 and Figure 11, below, show the benchmark results we got while using this approach.
The benchmark results show that the Disruptor with TimeOutBlockingWait strategy is a good choice because of the increase in throughput and decrease in CPU time compared to the baseline. These improvements occur because of the Disruptor’s non-blocking algorithm. The benchmark results also show that this approach did not perform better than the ConcurrentLinkedQueue approach, revealing that the writer threads’ signaling was the bottleneck.
Disruptor using SleepingWait strategy
After using the timeout blocking wait strategy, we wanted to try out the Disruptor with the sleeping wait strategy. When using this wait technique, the Disruptor’s consumer threads that are waiting for new entries initially go through a busy-wait, then use a thread yield, and eventually sleep. According to the Disruptor documentation, this waiting strategy will reduce the producing thread’s impact as it will not need to signal any conditional variables to wake up the exporter thread. Figure 12 and Figure 13, below, show the benchmark results we got while using this approach.
The benchmark results show that the Disruptor with SleepingWait strategy is not a good choice because of the significant increase in CPU time compared to the previous approaches. We noticed that this increase in CPU time was due to the busy-wait of the exporter thread. It is important to note that the benchmark results show that this approach was better for high throughput. This high throughput occurred because this approach did not require signaling of the exporter thread. However, the throughput flattened out with 10 threads or more due to the pressure on the memory subsystem and the CPU cycles wasted by the busy-wait. After analyzing these benchmark results, we moved to explore new ways of reducing the signaling overhead while lowering the exporter thread’s CPU time.
Batching the signals to reduce the exporter thread’s CPU time
Thus far, the benchmarks with different implementations showed us the most significant bottleneck to throughput is the writer threads’ locking and signaling, and the significant CPU cost to the exporter thread is context switching or a busy-wait.
The exporter thread is notified about any new entry right away, even though it will export the spans only when its buffer reaches maximum export size. This kind of exporting behavior allows us to reduce the exporter's thread CPU time by notifying only when it will end up doing the export operation.
We use an atomic integer to let the writer threads know the number of spans the exporter thread needs for the export operation. The hypothesis is that this kind of signal batching will reduce the writer threads’ contention and the exporter thread’s frequent context switches. Below is a pseudocode implementation of the signal batching.
// pseudocode of the writer thread
if queue.offer(span):
if queue.size() >= spansNeeded.get():
// notify the exporter thread
signalExporterThread()
// pseudocode of the exporter thread
while continue:
while queue.isNotEmpty() and buffer.size() < EXPORT_SIZE:
// Fill the local buffer
buffer.add(queue.poll())
if buffer.size() >= EXPORT_SIZE:
// Export the buffer
export(buffer)
if queue.isEmpty():
// Wait till there are enough spans for next export
spansNeeded.set(EXPORT_SIZE - buffer.size())
waitForSignal(WAIT_TIME)
spansNeeded.set(Integer.MAX_VALUE)
BSP using signal batching and ArrayBlockingQueue
This implementation uses the ArrayBlockingQueue but batches the signals. Figure 14 and Figure 15, below, show the benchmark results using this approach.
The benchmark results show that the signal batching with ArrayBlockingQueue is a good choice compared to the baseline because of the improvement in throughput and a significant decrease in the CPU time. But, the throughput suffered with the increase in number of threads due to lock contention and pressure on the memory subsystem. We therefore proceeded to test signal batching with other queue implementations.
BSP using signal batching and ConcurrentLinkedQueue
This implementation uses Java's inbuilt concurrent linked queue to reduce lock contention and signal batching to reduce the exporter thread’s CPU time. Figure 16 and Figure 17, below, show the benchmark results using this approach.
The benchmark results show that the signal batching with ConcurrentLinkedQueue is a good choice compared to the previous approaches because of the improvement in throughput and significant decrease in the CPU time. We then proceeded to test signal batching with other queue implementations to see if we could find an even better solution.
BSP using signal batching and disruptor
The LMAX Disruptor doesn’t offer any batch waiting strategy. Although it is possible to implement a new waiting strategy inside the Disruptor, we found an alternate solution, described in the next section, that is much easier to implement.
BSP using signal batching and MpscQueue
Even though LMAX Disruptor is not applicable with regard to batching signals, the Disruptor’s fundamental design principles, as shown in the above benchmarks, worked very well to mitigate the concurrent linked queue’s contention issues. We learned about different concurrent queue implementations offered by the JCTools library, and the multi-producer single consumer queue (MpscQueue) worked perfectly for the BSP’s use case.
- It is tailor-made for the BSP’s use case where we have multi-writer threads and a single consumer, the exporter thread.
- It uses a ring buffer and sequencing to resolve head or tail contention. The preallocated ring buffer implies there is no unnecessary garbage collector activity.
- It uses memory padding to avoid false sharing.
- MpscQueue doesn’t signal the consumer about the new entries, so it is easy to implement custom signaling with batching.
We noticed that MpscQueue is very similar to the Disruptor, but we can easily apply the signal batching. Figure 18 and Figure 19, below, show the benchmark results using this approach.
The benchmarks show that MpscQueue with the signal batching approach is best for high throughput and a less CPU intensive batch span processor. MpscQueue does resolve the write contention issue seen in ConcurrentLinkedQueue and, with signal batching, reduces the CPU overhead of the exporter thread. The throughput flattened out with ten threads or more due to the pressure on the memory subsystem as seen in the previous benchmarks. We learned that the existing waiting strategies with the Disruptor tend towards lowering the latency and are CPU intensive, whereas MpscQueue with signal batching proved to be a good alternative.
And finally, the load test that initially revealed the problem showed us the optimization reduced CPU utilization.
Trade-Offs with signal batching
Although MpscQueue with signal batching worked best with the default configuration options, it hurts throughput if the export batch size (otel.bsp.max.export.size)
is close to the maximum queue size (otel.bsp.max.queue.size)
or higher. The trade-off for higher throughput is using a bigger queue with a size greater than the export batch size.
Conclusion
For any companies running on a microservice architecture, distributed tracing is an essential tool. OpenTelemetry is an excellent choice for addressing distributed tracing needs, but the performance issues we have seen with OpenTelemetry’s BSP component could negatively impact throughput and increase resource usage costs.
To address the negative impact, we contributed benchmarks and fixes to the OpenTelemetry project. Specifically, we contributed JMH benchmarks and optimizations in the BSP to improve throughput and reduce CPU cost. These optimizations will be available in the next release of the OpenTelemetry java library. Generally speaking, we believe the benchmarks and the different queue implementations we discussed could be applied to build highly performant multi-threaded producer-consumer components.
Acknowledgments
Thanks to Amit Gud and Rabun Kosar for their OpenTelemetry adoption efforts. Many thanks to the OpenTelemetry project maintainers who responded quickly to the batch span processor’s performance issues and worked with us to implement the optimizations.