Skip to content

Blog


Leveraging Flink to Detect User Sessions and Engage DoorDash Consumers with Real-Time Notifications

November 7, 2023

|

Chen Yang

Fan Zhang

At Doordash, we value every chance to boost order conversions in the app. When users fail to complete a purchase after adding items to their carts, we send push notifications such as the one shown in Figure 1 to remind them that their orders are still pending. It has been difficult, however, to determine whether users actually have abandoned their carts or instead are simply browsing for more items or different merchants within the app.

Figure 1: An example of a real-time push notification to prompt purchase completion.

To polish the notification experience, we want to ensure that cart abandonment notifications are sent in a timely manner - but only when users truly stop adding to their carts and abandon the app. 

To do this, we built a large stateful Flink job to track user sessions using mobile/web client events. By analyzing user activity and detecting periods of inactivity in short time frames, we now can send notifications at more opportune moments. This new solution has yielded significant conversion improvements. We delve here into the different options we considered, why we believe our new design is most efficient, and what we have learned as we implemented this solution.

Deciding whether to trigger in the frontend or backend

Backend events typically are generated by internal services, while frontend events are created by the mobile app or web. If our goal is to send notifications to users who update their cart and then stay idle for X minutes, how do we let the system know when to trigger the workflow?

Cart abandonment V1

Our old design required constantly checking the cart update status by calling the backend cart service, kicking off the send workflow when there were no updates. 

Figure 2: Cart abandonment V1 workflow.

As shown in Figure 2, this design does not reflect the actual status of user activities. It just blindly checks the backend services in the fixed gap and sends potentially unnecessary notifications to the user, reminiscent of my daughter when we're driving somewhere:

Cart abandonment V2

In our new design, however, real-time session detection offers a smarter solution that eliminates redundant calls, as shown in Figure 3:

Figure 3: Cart abandonment V2 workflow.

In this design, all user activities in the mobile and web applications generate analytic events. The real-time session detection's job is to receive all frontend events continuously and group them into sessions. This allows detection of when a session ends as a result of user inactivity, generating a signal to the notification service to trigger the workflow.

Trigger: Real-time session detection

When customers use DoorDash's app and website, a number of analytics events are generated that yield information about a user's experience and activities. Detecting these user sessions can give the downstream notification services insights about user activities, including start time and end time. This is an opportunity to develop an intelligent notification system that can send context-aware prompts at the appropriate time with relevant content.  

Sessionization traditionally is done as a batch process at DoorDash. But there are huge computational costs and latency involved in joining multiple events and detecting gaps, caused largely by having to load massive amounts of data from cold storage to memory prior to processing. Additionally, the hours required to complete a batch process are incompatible with real-time session detection. 

The real-time platform team leverages our streaming platform to create sessions by grouping and identifying various mobile and web events, as shown in Figure 4. We develop a sessionization platform for the application teams to easily define and evolve sessions with different input events and session gaps. Because it is centered around processing in-memory, stream processing eliminates the need to load data from cold storage; events are processed as soon as they arrive over the network and in-memory, reducing latency and the cost of data loading. With the sessionization job, we can easily detect when customers finish a session in real-time, allowing the application to execute operations with these signals.

Figure 4: Real-time session detection data flow.

There are, however, a few challenges that need to be addressed when using real-time session detection. 

  • Infrastructure for large-state computing: To build input events sessions, Flink needs to keep all user session events until the session ends. This inflight data is considered as state in Flink and is managed in the RocksDB state backend. With average DoorDash user sessions lasting around an hour, we will need to maintain an hour's worth of user activities for each customer as its state. That means that the state would be hundreds of gigabytes at any moment. To use the RocksDB backend, we can't use the existing infrastructure, which shares local storage across jobs. We worked with DoorDash's infrastructure team to create a means to facilitate this kind of stateful computation with guaranteed persistent volume for each Flink task manager.

  • Job failure and recovery: Failure recovery is done at Flink's framework level by restarting failed tasks from the last checkpoint. The checkpoint is persisted in S3 to make it durable and scalable. The Flink job manager is set up in high availability mode with two job managers, one as leader and the other standing by. Upon job manager failure by the leader, the standby job manager immediately assumes the lead role and resumes the job from the job manager state stored separately in S3.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Send: Cart abandonment V2 notification workflow

After the cart abandonment workflow is triggered, we will not send notifications to users if:

  • The cart is no longer available, for example, if the user already has checked out.
  • Minimum requirements for sending aren't met, such as missing critical fields.
  • There are other reasons that would make a notification inefficient.

Notifications will reach consumers through the notification platform only if all eligibility checks have passed, ensuring that these messages are timely and meaningful.

Result: Outperform other notifications

The new design for cart abandonment drives significantly more order volume and revenue. During our experiment, there was a six-fold increase in notification send speeds. We believe because triggering is more accurate and timely, notifications are more relevant to consumers. As a result, consumers engaged with notifications more, including a 40% higher send-to-open rate and a 2.6% boost in visit rates over the course of the experiment. 

Future Work: Cart abandonment notifications with promotions and more

Our success with redesigning cart abandonment notifications has led to a new series of experiments, including:

  • Adding promotions to cart abandonment notifications when there are eligible items in carts, encouraging consumers to continue checkout at a cheaper price.

  • Controlling the frequency of cart abandonment notifications. By default, we now send no more than one notification to a user per day. But through a sophisticated new analytics effort, we are experimenting with improving visibility by selecting target consumers to receive more than one per day.
Figure 5: Real-time event Sending workflow.

Because we own frontend events, adding other notifications such as merchant abandonment notifications seems to be a natural evolution. With this system in place, such an expansion should be simple because the pipeline already is in place to manage notifications associated with consumer behaviors. In fact, each new notification is only two steps away: 

  1. Define a real-time event. The sessionization platform can easily define and evolve sessions with different input events, session gaps, and session outputs. The real-time infrastructure team can build automation and UI for developers to easily onboard new real-time signals. That team also is exploring options to enable developers to define custom logic to process a session event - for example, filtering against clients who viewed merchants but didn't place any order. 

  2. Combine real-time events with the sending workflow. By adding necessary validations, the sending pipeline will skip signals that missed critical fields, such as store name and consumer id. Because the sending pipeline is completely configuration-driven, it'll be more flexible and much easier to onboard new sendings rather than update codes.

Advice for working with frontend events

By allowing new approaches that can't be done by backend events alone, frontend events play a critical role in cart abandonment use cases. Before you adopt frontend events in your next design, keep in mind:

  • Frontend data from mobile web activities is not always reliable or stable. Data could be corrupted or important fields may be missing. Verify data quality before proceeding.

  • Various glitches may cause unexpected delays. During our experiment, some frontend events reported delays exceeding 24 hours. In some cases, delays were caused when the application was terminated before event syncing to services was completed; delayed events then weren't reported until the application reopened. An application should take possible delays into account and have methods for handling them.

  • Make input more reliable by checking consistently with the frontend team to resolve any issues quickly.

  • Use both frontend and backend data. Backend calls may be able to answer questions that the frontend can't.  In the cart abandonment workflow, fields such as country code and store open status can be fetched from backend services. There is invaluable synergy between frontend and backend data. It not only enables more accurate decision-making, but also empowers an organization to deliver personalized and contextually relevant experiences to users.

Conclusion

We are thrilled with the results of this project. It not only reduces the likelihood of customers leaving essential items in their carts, but also enhances DoorDash order processing efficiency, improving overall user satisfaction. Leveraging real-time data opens a pipeline to a multitude of future possibilities. For example, the app may suggest the related store or items based on the store view and search history in the sessions. We eagerly anticipate incorporating additional experiments and taking further strides toward elevating our app's user experience.

Acknowledgements

Applause to team members who directly contributed to the project: Mengyang Cui, Eleanore Jin, Julie Xue, Sonic Wang, Allen Wang, Yanlin Peng, Nicole Lin, James Smith, Eric Bencomo Dixon, and Kristine Fernandez.

Special thanks to the following team members who closely collaborated with us: Karthik Thota, Michael Adaikalaraj, Xavier Hodges, Praveen Goparaju, Mike Pitre, Jeremy Pian, Chun-Wei Lee, Almer Bacallan, Abhijeet Bane, Keith Lehmann, Lisa Shang, Bronson Mirafuentes, Jason Lai, and Yun Huang.

About the Authors

  • Chen Yang

    Chen Yang is a software engineer on the Real-Time Streaming Platform team at DoorDash. His focus is on building the streaming processing platform and real-time related products.

  • Fan Zhang

    Fan Zhang works as software engineer on both the Growth and PAD team under Marketplace at Doordash, his focus is building notification sending pipelines and other products like notification hub.

Related Jobs

Location
Sunnyvale, CA; San francisco, CA
Department
Engineering
Location
Sunnyvale, CA; San francisco, CA
Department
Engineering
Location
Sunnyvale, CA; San francisco, CA
Department
Engineering
Job ID: 2915998
Location
Sao Paulo, Brazil
Department
Engineering
Location
Sunnyvale, CA
Department
Engineering