Ideation and Requirements: The prediction service’s role in Machine learning (ML) infrastructure
Before starting any actual coding, we took some time to consider exactly what role Sibyl would have in DoorDash’s ML infrastructure ecosystem, as well as all the required functionalities that this new prediction service needed.  In terms of its role, we wanted the Sibyl prediction service to handle all real-time predictions, and to focus on predicting, leaving other components, such as feature calculation, model training pipelines, and storage of features and models, in separate services/stores. The first thing we considered was scalability and latency, given that we were expecting hundreds of thousands of predictions per second, and that in order to convince other services to call our service for predictions, we needed to make them so fast that calling Sibyl would be better than the alternative of each individual service making predictions itself. How Sibyl prediction service fits into the ML infrastructure ecosystem: As you can see by Figure 1, all predictions come in from other services as gRPC requests, and Sibyl retrieves both models and features from independent stores (with the option of outputting predictions to Snowflake for offline evaluation). Possible future expansion includes a separate model evaluator, which can be used for pure prediction computation needed for complex models. For V1, however, this will be included within Sibyl. In terms of required functionalities, here were some highlights:- Batch predictions: We should allow each prediction request to contain any variable amount of feature sets to predict on (note: a “feature set” is a set of feature values we want to predict on. A simple way of thinking about this is that a feature set is just an input we want to predict on). Batch predictions are essential, as they allow client services to send and retrieve 1 to N predictions at once, greatly reducing the number of calls needed.
- Shadow predictions: In addition to making predictions and sending them back to client services, the option to make shadow predictions asynchronously in the background was essential. Oftentimes, before finalizing and clamping down on one particular model, teams may have multiple candidate models, and want to test multiple models at once on the same data. Allowing teams the option to use one model for official predictions, and to asynchronously make predictions on the same data with different models, gives them the flexibility and power to analyze the efficacy of various candidate models.
- Feature and model fetching: As mentioned above, Sibyl needs to be able to fetch both features and models from their respective stores. For features, they would be fetched during each prediction, and for models, we could save time and computing power by first fetching all models when the service starts up, then caching them in-memory, eliminating the need to load them for each request.
Implementation: General service overview and decision highlights
To get a general understanding of how the service works, as well as a brief overview of the moving parts of the service, here’s what the lifecycle of a typical request looks like: Referencing Figure 2:- The request arrives.
- For each model identified in the request, we grab both the model, as well as the model config (which contains info on that model, such as all required features, default fallback values for features, and model type) from an in-memory cache. 
- Then we iterate through the values provided in the request, and find out if there are any missing feature values that were not provided. We do this for all models and all feature sets at once, and store values in a map for easy lookup.
- For all missing features, we attempt to retrieve feature values from the feature store, which is a Redis cache of feature values. If they still cannot be found, we set the feature values as the provided default value in the model config.
- Now that we have all features and all feature values required for predictions, we asynchronously make a prediction for each feature set. For each shadow model, we also launch an asynchronous coroutine, but don’t wait on the results to finish before continuing.
- With all predictions made, finally, we construct a response protobuf object, and populate it with the predictions, returning the protobuf back to the client.