Building an in-house case management platform

Consumers, Dashers, and merchants rely on DoorDash support every day to quickly resolve issues like missing items or delayed orders. While we automate many solutions, complex problems require live agent attention. Historically, agents navigated multiple disconnected internal and third-party tools across chat, phone, and web. Despite their strengths, these fragmented systems slowed responses, required manual workarounds, and created inconsistent experiences for both agents and customers.

To tackle these challenges, we embarked on a multi-year journey to build a unified, in-house case management platform. Our goal was clear: Eliminate fragmentation, establish a single source of truth, and create seamless interoperability across systems — all without sacrificing flexibility or innovation.

Instead of merely replicating existing processes, we leveraged each migration step to introduce compelling features that would inspire stakeholders and drive meaningful improvements. This post shares insights from our journey, including key decisions, trade-offs, and lessons learned, offering valuable perspectives for teams navigating similar large-scale migrations.

Our journey

Our effort to build a comprehensive case management platform started more than two years ago, driven by DoorDash's complex operational needs. Instead of a massive, upfront replacement, we began with incremental goals and a clear vision.

Agents previously had to switch frequently between siloed tools, causing inefficiencies. Figure 1 illustrates the multiple interactions across various disconnected tools required for an agent to help a customer and resolve a case successfully. Our solution has unified these fragmented data views, greatly reducing context switching.

*Figure 1: The Previous agent resolution process was convoluted.*

As shown in Figure 2, we iteratively migrated agents, introducing essential features as soon as they were viable. The chart uses a logarithmic scale, clearly highlighting significant early adoption and rapid percentage increase during the initial migration stages. Notably, the graph reflects users fully migrated to our end-state solution rather than the earlier partial migration involving only the UI layer. These early wins helped drive momentum and validated our incremental approach, enabling continued platform expansion to support additional teams over time.

*Figure 2: Early wins eased user migration over time.*

Building the foundation: One unified UI

First, we tackled improving the user interface. Previously, agents worked directly within a customer relationship management system, or CRM, relying on an internal tool called Dispatch for delivery data. In place of that process, we created a centralized React-based UI called Teammate Headquarters that focuses on cases to wrap existing functionality incrementally while effectively managing state.

To provide data to this UI, we built a lightweight Kotlin backend to cache and consolidate CRM data, enabling faster access and better integration. Extensive user research — including shadowing agents via DoorDash's WeSupport program, visiting a call center, and conducting personalized shadow sessions — helped us to deeply understand workflows and informed our designs.

Figure 3 illustrates the calls made by the UI. The top line shows all the incoming requests, where we can see some of the early scaling that happened. The bottom line shows calls to the CRM, showing we were effectively caching the data and offloading calls.

*Figure 3: The top line shows UI fetching during early 10x scaling, while the bottom line reflects calls to CRM.*

Early experimentation with minimum viable product 🚀

Previously, proactive delivery issues were managed with a rudimentary internal tool called the Realtime Disaster Tracker that relied on manual tracking via Google Sheets. Assignments were handled by manually computing delivery_uuid % constant, as shown in the “Assigned Agent” column in Figure 4. To resolve this inefficient system, we integrated core functionalities directly into our platform, including features such as automated case creation, assignment, and status tracking.

*Figure 4: Earlier versions of the Realtime Disaster Tracker used computed agent IDs, with no link to real users, availability, or clear reporting metrics.*

We routed the signals through an existing DoorDash workflow tool, enabling operations teams to customize delivery targeting. We also introduced experimentation tooling to validate signal impact and ultimately developed a self-serve agent onboarding process. Figure 5 shows how we were able to improve the integrations, providing additional capabilities such as experimentation, further curation, and reporting while adding core reusable building blocks such as case assignment that can be leveraged in future phases.

*Figure 5: Updated E2E case creation and assignment for Realtime Disaster Tracker includes reusable building blocks for future expansion.*

Starting with this simpler, proactive outreach use case created minimal disruption while quickly demonstrating value and establishing a repeatable migration process for future expansions.

Scaling up: Fraud, phone, and chat integration

Next, we integrated complex paths such as fraud investigations and phone/chat support.

Fraud: Refund requests requiring manual review overlapped in large part with support case functionality. We collaborated with the fraud team to migrate them onto the new platform, which improved scalability and reduced the risk of cascading failures.
Phone: We leveraged the existing softphone system to integrate Inbound calls via APIs, which minimized complexity.
Chat: Chat cases used all of the foundational components, adding real-time data visibility and self-serve configuration editing. Comprehensive integration into extract/transform/load pipelines ensured accurate reporting.

Figure 6 illustrates the current architecture of our case management platform, showcasing integrations with various internal and external services and mechanisms for providing relevant data as needed. This centralized architecture promotes interoperability, flexibility, and streamlined workflows for agents and operational teams.

*Figure 6: DoorDash’s new centralized case management architecture streamlines processes across the board.*

Technical challenges and solutions

During migration, several technical challenges emerged, each addressed through targeted solutions.

System interoperability: Operating old and new systems concurrently created complexity in data consistency and source-of-truth management. We used namespace-based hashing (UUID v5) to seamlessly map identifiers while implementing functionality to transfer cases across both systems.
Operational excellence through self-serve configuration management: We prioritized self-serve configuration tools to support operational scaling, which empowered stakeholders to make changes without engineering assistance. We initially deferred building these self-serve capabilities until a critical mass of agents had migrated to ensure we delivered tangible business value from the start. This approach prevented wasted effort on potentially unnecessary features and maintained engineering focus on high-impact work.
Caching and data freshness: We implemented tailored caching strategies to balance accuracy and performance depending on the use case.

For case data that required near real-time accuracy, we would lazily tag our cache records as stale based upon update events the CRM would publish, as shown in Figure 7. We did not eagerly fetch all of the data on the change notification event. Instead, we would lazily fetch it on read if it was determined to be out-of-sync..

*Figure 7: Cache is tagged out-of-sync when notified of change and lazily fetched on read for out-of-sync records.*

Figure 8 shows different cache-busting techniques that we employ, allowing the system to fully refresh cached data when we require guaranteed accuracy. The techniques also allow us to overcome any missing events.

*Figure 8: Two cache-busting paths allowing the system to self-heal for guaranteed accuracy and handle missing events.*

For less time-sensitive data, such as agent information, we opted for eventual consistency, using daily jobs to refresh caches from our data warehouse as shown in Figure 9. This allows us to process data at scale, handling only data that is no longer accurate.

*Figure 9: Eventually consistent cache refreshes data only as required.*

Efficient case assignment to agents

We adopted a micro-batch assignment strategy, synchronously updating cached agent availability and active case counts during assignment to avoid overload. We asynchronously recomputed these counts using a change data capture feed as cases resolved. Incremental complexity, such as multi-queue assignments, was introduced only when needed.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Case popup after assignment

To enable quick responses to customers and guard against idle time for agents, cases must appear almost instantly in the agent’s browser. We deployed several measures to minimize delays:

Short polling enabled near real-time updates.
Optimized indexing and caching improved lookup times.
Fallback mechanisms efficiently invalidated caches.

We explored push-based solutions but ultimately found polling sufficient, avoiding extra complexity. To handle agents with multiple browser tabs open, we also introduced per-user rate limiting and endpoint caching, preventing conflicts and race conditions across sessions.

Figure 10 shows the various components and interactions required to successfully connect customers with agents during chat cases.

*Figure 10: Chat case popup logic in caching CRM phase* connects customers with agents.

Flexible schema management

To accommodate evolving data without constant code changes, we leveraged JSONB storage with a generalized inverted index for dynamic fields, using Protobuf Structs at the API level for structured flexibility. Core fields were standardized in code, with configurable metadata managing queue-specific schemas.

Key Results and Impact

Elimination of workarounds

Standardized processes allowed teams to eliminate inefficient practices such as copying IDs or URLs and manually searching cases. Our close alignment with DoorDash’s core technology stack enabled efficient data retrieval through referencing identifiers rather than duplicating data.

Improved metrics and performance

We closely monitor metrics at DoorDash to track progress toward our objectives and key results. The primary success metric for this project was the number of users migrated to the new system. Beyond simple adoption, the platform enabled faster iteration by allowing multiple outbound case path experiments and validating their impact in real time.

To ensure the transition maintained or improved operational efficiency, we tracked several key metrics:

Average handle time (AHT) – time taken to resolve cases
Manual tasks per resolution (MTR) – effort required for case resolution
Did we resolve (DWR) – resolution effectiveness
First contact resolution (FCR) – percentage of cases resolved without follow-up

These metrics remained stable or improved post-migration. Additionally, qualitative feedback from agents highlighted a significant speed improvement in case popups, making workflows more seamless and efficient.

Standardized tech stack and enhanced scalability

A major win on the technology side is that we are now fully integrated into the standard DoorDash tech stack. This provides several key benefits:

Faster data availability: Data is now available in the data lake within roughly 15 minutes, compared to the previous daily processing cycle.
Optimized data storage: We have improved efficiency by storing identifiers for entities instead of duplicating raw data.
Greater engineering support: More engineers can contribute to updates because the system now leverages standard infrastructure, coding languages, and GitHub repositories, positioning us for better scalability and long-term maintainability.
Improved system isolation: We have eliminated some risks of cascading failures that existed in the old internal system and now have direct control over the infrastructure.

With the new platform, we can also take advantage of DoorDash’s standardized metrics stack, including real-time alerting, as well as use our own deployment schedules and rollbacks, rather than relying on third-party vendors.

Another key advantage is improved cost forecasting. With the new system, there is less risk of unpredictable cost increases during contract renewals, giving DoorDash greater financial stability and control over expenses while reducing vendor lock-in.

Conclusion

Our journey building DoorDash’s in-house case management platform highlights the power of incremental, iterative improvements. By initially focusing on agent inefficiencies and fragmented tools, we systematically delivered impactful enhancements, reduced operational costs, and improved user experiences.

The insights from this project extend beyond case management, offering broader lessons on balancing innovation with practicality, maintaining stakeholder alignment, and managing complex migrations incrementally to minimize risk.

Looking ahead, this robust foundation enables deeper automation, advanced AI integrations, and seamless interoperability, empowering DoorDash’s support operations to scale efficiently with ongoing company growth. We hope sharing our experiences inspires and informs similar transformative efforts.

Acknowledgements

There were numerous team members and contributors to the project over its lifecycle. We would like to give special acknowledgements to: Derrick Hu, Han Huang, Kumaril Dave, Vivek Parekh, Cameron Javier, Vikas Jindal, Matt Diana, Manas Narayanappa, Kishore Guruswamy, Abhinav Mummidi, Nelly Daniel, and Jenny Yang.