Skip to content

Blog


Scaling a routing algorithm using multithreading and ruin-and-recreate

November 30, 2021

|

Ben Katz

At DoorDash, route optimization is a key component of our dispatch system, known internally as DeepRed. The routing algorithm design we chose — based on the ruin-and-recreate principle —  helped us to scale to meet the increased demand caused by COVID relief efforts and the subsequent increased API requests from major partners, including Walmart, Macy’s, and United Way. First we will delve into the vehicle routing problem and how DoorDash’s Drive team managed to automate our routing algorithm to meet increased demand.

Routing algorithm design with the ruin-and-recreate principle

Route optimization is a computationally intensive process and a commonly known NP-hard problem. As such, attempting to optimize a complex route containing a number of distinct locations scales poorly if every location must be optimized synchronously. Synchronous optimization means that a large number of locations would take hours, or even days, to complete, which won’t work when route optimization must be completed in seconds. 

Solving this problem for last-mile logistics introduces additional complexity because a multitude of orders must be delivered same-day under a range of constraints, including varying delivery windows, vehicle capacity, speed, and Dasher, our name for delivery drivers, availability. For this scenario, a route’s quality must be balanced against the execution time required to optimize the route — all while keeping spending on API calls to a minimum.

To address these concerns, the algorithm we wrote relies on the ruin-and-recreate principle, which is a methodology that allows us to solve an optimization problem against a set of constraints. In the context of a routing algorithm, the ruin-and-recreate principle can be applied as follows:

  • Input is given as a list of jobs to optimize. Think of these “jobs” as geographical locations on a route along with requested actions at each location, like “pick up” or “drop off.” We also need to consider a candidate list of drivers who can potentially fulfill each job.
  • A randomness factor is used to generate an initial solution involving all route points; usually, this is not the optimal solution.
  • In the subsequent "ruin" step, the randomized solution is split into two components. The first component is a set of jobs that a driver cannot serve, given constraints like time or vehicle speed. The second component is a partial solution containing all jobs that can be handled successfully by a driver.
  • The "recreate" step combines the list of "not served" jobs with the partial solution containing “served” jobs.
  • Ultimately, a combined solution is generated and scored. If it exceeds the quality of previous solutions, it is now considered the best solution.
  • These steps are repeated until termination criteria (in other words, a condition used to know when to stop running the algorithm) are met, for example a time limit or a maximum number of iterations. The final output is a list of solutions with those scoring highest at the top. For context this algorithm’s output is then used to offer Dashers deliveries in our dispatch service .

While the basic ruin-and-recreate approach creates an algorithm that generates a list of optimized routes, it would work too slowly and offers no safeguards against failure if run synchronously against a large input size — say, more than 1,000 stops. 

A trivial greedy route optimization algorithm — one in which the driver’s next closest destination is always chosen, regardless of the route’s total cost — can be written as a fallback against failures — for instance, if the ruin-and-recreate algorithm encounters exceptions or timeouts. A greedy route optimization algorithm also can serve as a useful benchmark solution to ensure that the ruin-and-recreate solution is, indeed, the best possible quality.

To improve execution time for large input sizes, it can be useful to apply techniques such as multithreading that split the problem into chunks of jobs that can be optimized concurrently. Before we go into more detail on how we adapted this algorithm to DoorDash’s logistics engine, we’ll first explain why DoorDash chose to focus on improving route optimization.

DoorDash large route optimization: 2018-2019

As DoorDash pushed to grow the business in 2018 and 2019, it was constrained by a dispatch system that could only support two orders per Dasher at any given time. Such a limitation was acceptable for restaurant orders requiring tight time limits on delivery. But it was a deal breaker for retail partners using DoorDash Drive, our white-label fulfillment platform which is utilized by a number of merchants who want to offer delivery, via the dasher network, through their own order platforms. For them, store operations could be overwhelmed by too many Dashers showing up simultaneously to pick up orders. In turn, Dashers would experience long waits to fetch orders, ultimately delaying deliveries to end customers and reducing the active time Dashers could be spending earning on deliveries. 

Our small operations team initially resorted to a stop-gap solution that we knew would need to be replaced with a scalable long-term solution. Each day we would compile and upload all orders that needed to be fulfilled. Our internal system would then make synchronous API calls to create deliveries and manually assign a Dasher to each order. The routing was all manual, with ops specifying a numerical order for each delivery. This synchronous solution meant we could handle only about 100 orders at a time. On top of that, time and resourcing constraints meant that the API used to create deliveries was separate from the public-facing API; insufficient validation caused confusion for team members who used incorrect file formats. Such a tool couldn’t handle any significant request volume and created maintenance overhead because of the separate delivery creation API:

Figure 1: A simplified view of the portion of the system that handled large routes as it existed in 2018-2019.

DoorDash large route optimization: Enter COVID-19

When COVID-19 was declared a pandemic in March 2020, DoorDash experienced an unprecedented spike in demand. One way this impacted DoorDash was through our fulfillment of school meals.  Previously, underprivileged students enjoyed reduced-price and free meals at school. With shelter in place orders enacted, these students lacked access to this subsidized nutrition. Larger public school systems and nonprofit organizations reached out to DoorDash for help efficiently delivering these meals from the schools to local neighborhoods.

How we automated our route optimization algorithm

Fortunately, DoorDash’s team was able to quickly replace its manual, ops-driven solution with a more automated one that could handle this increased demand. To meet the sudden spike in pandemic-related demand, we had to solve the overall route optimization problem by designing and building a number of different components . 

The first component needed to be able to ingest hundreds of thousands of orders daily in a matter of minutes. While most orders could simply be supplied via the merchant via API request, some partners were low-tech and lacked the ability to call our API, requiring us to build a new CSV uploader for them to use.

After orders were successfully entered into our system, we needed a second component to ensure routing was done efficiently, inexpensively, and in a way that   prevented too many Dashers showing up at the same location simultaneously.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

To provide that first component for low-tech merchants, we used Amazon S3 to store large CSV files, an automated worker process to grab the files from S3, and applied multithreading to call our public-facing create-delivery API to ensure orders were ready for fulfillment. That API already had validation built in, making failures much more visible to any team member who uploaded a file.Using this approach, the initial upload of orders for low-tech partners remained partially manual, but the rest of the system had been fully automated. This meant that operations would first upload a CSV containing a merchant’s daily deliveries — no more restrictions on how many deliveries could appear in each file — then a worker would chunk the input into equal parts and process all parts asynchronously using RxJava. At the time, Kotlin coroutines, syntactic sugar used to manage multitasking and allowing tasks to be suspended and resumed automatically, were new and untested, so we went with the more established pattern of applying RxJava. Inside the processing step, we’d make a call to our create-delivery API to get the order queued for fulfillment. A final worker process in our fulfillment system would make an API call to a third-party route optimization API to generate optimized routes containing all of the deliveries supplied in the first step. These optimization and assignment components are depicted in Figure 2 below:

Figure 2: A simplified diagram of the improved large route ingestion pipeline, built at the beginning of COVID-19.

DoorDash large route optimization: Further improvements in 2021

While the system described above initially worked well for pandemic relief, we encountered pain points with the third-party API used for route optimization, including:

  • No control over making changes to the routing algorithm without asking the third-party team to make fixes; often we could see that a suboptimal route was generated, but we had no visibility into why.
  • Expensive billing practices meant that we couldn’t easily scale this process to the entire set of DoorDash deliveries because API request costs would spiral out of control
  • The rest of the company was not using this service for route optimization, so Dasher experience differed significantly across Drive and Marketplace; bug fixes applied to the rest of the system often were not backported to the above Drive system.

Expanding upon our 3rd party routing 

Prompted by a request from one of our biggest partners to apply large routing at all of its stores, we worked in late 2020 and early 2021 to improve DeepRed’s routing algorithm to support a large number of orders and deprecate the third-party routing algorithm mentioned in Figure 2 above. We built our own routing algorithm for large order sizes using the ruin-and-recreate approach discussed earlier. Instead of building from scratch, we found an open source library that used the same ruin-and-recreate approach for route optimization and modified parts of it to fit DoorDash’s needs. Using some of our proprietary ML models, we built custom logic to estimate travel duration and discern which stops should be made.  In the end, our system design included no  third-party dependencies for route optimization.

We initially attempted synchronous route optimization. But when we ran this against a benchmark of 10,000 deliveries, we observed the execution time to be on the order of minutes — far too slow for our needs. To support DoorDash’s scale, DeepRed’s route optimization code runs in a sharded manner; we associate specific geographical regions to a shard and then optimize only within that shard while using Scala Futures to ensure processing is done asynchronously. This asynchronous approach, which behind the scenes employs a thread pool to manage simultaneous tasks, allows us to reduce execution time from several minutes to an average of a few seconds. The following is the simplified Scala code that runs inside each DeepRed node to generate a list of optimized routes for each Dasher-deliveries pair:

...
val futures = for {
 (deliveries, dashers) <- candidates
 dasher <- dashers
} yield {
 Future {
   runOptimizationAlgorithm(state, maxIterations, maxRoutesPerShiftCandidate, algorithmTimeoutTermination, numRoutesToLog, deliveries, dasher)
 }
}

val futureSequence = Future.sequence(futures)
val optimizedRoutes = try {
 recordTimeTaken(...) {
   Await.result(futureSequence, totalTimeoutDuration)
 }
} finally {
 ...
}

shiftCandidatesWithRoutes.flatten.toMap // Return results for further filtering, scoring.

The abbreviated system diagram in Figure 3 shows how our delivery creation process interacts with sharded Deep Red nodes.

Figure 3: The above is the current approach used to fulfill large routes with DeepRed. DeepRed executes in a sharded manner to make routing decisions, then a separate API call is made to assign dashers to outstanding orders.

While there are several benefits to the system depicted in figure 3, which uses geographical sharding and a custom route optimization algorithm, there are also a few drawbacks.

Pros:

  • In-house algorithm results in extremely low costs
  • Improved latency compared to third-party service
  • Better integration with overall DoorDash system; adherence to common interfaces yield benefits when DeepRed bugs are fixed upstream
  • Open source library provides ultimate flexibility to change the algorithm to produce different kinds of routes

Cons:

  • While the algorithm generally produces high-quality routes, timeouts do occur because of a hard time constraint required by multithreading and the intentionally limited threadpool. In the event of a timeout, the greedy algorithm results — which generally are suboptimal — are used as a fallback
  • This approach won’t scale to support our future vision and will need to be updated in due time


A future vision for large route optimization

Although this approach is working well for now, future demand on the routing product will lead us in several new directions: 

  • Allow low-tech merchants to upload orders themselves. The goal is to use a merchant-facing portal instead of requiring DoorDash’s operations team to perform manual uploads on their behalf.
  • Support 100 or more orders for a single Dasher. Although we support tens of orders per Dasher currently, we imagine a world where a Dasher can pick up hundreds of orders to deliver across a city over the course of a day. To support this, we must run route optimization offline to better isolate this from the rest of our assignment system. We’re also considering further parallelization by applying a message queue to ensure chunks of deliveries to optimize are handled by separate containers instead of just separate threads.
  • Support large items with different vehicle types. Dashers primarily drive sedans today, which limits how many orders can be picked up. We’re considering opening up new opportunities for Dashers with trucks, SUVs, or vans to deliver furniture, appliances, and other oversized products. Larger vehicles could also be used as extra capacity for the 100+ orders use case.

Conclusion

By using the ruin-and-recreate principle in combination with a greedy routing algorithm while applying multithreading, we can achieve best-in-class route optimization and, consequently, an ideal delivery experience without sacrificing execution time.

If these kinds of problems excite you, we’d love you to come work with us as we build the next-generation version of our logistics engine!

About the Author

  • Ben Katz

Related Jobs

Location
San Francisco, CA; Mountain View, CA; New York, NY; Seattle, WA
Department
Engineering
Location
San Francisco, CA; Sunnyvale, CA
Department
Engineering
Location
San Francisco, CA; Sunnyvale, CA; Seattle, WA
Department
Engineering
Location
Pune, India
Department
Engineering
Location
San Francisco, CA; Seattle, WA; Sunnyvale, CA
Department
Engineering