Responding to Prediction Requests in Milliseconds

Ben Weber

As an applied data scientist at Zynga, I’ve started getting hands on with building and deploying data products. As I’ve explored more and more use cases for machine learning, there’s been an increasing need for real-time machine learning (ML) systems, where the system performs feature engineering and model inference to respond to prediction requests within milliseconds. While I’ve previously used tools such as AWS SageMaker to do model inference in near real-time, I only recently explored options for also doing feature engineering on-the-fly for ML systems.

Ad technology is one of the domains where real-time ML is a requirement to build a system that performs well in the advertising marketplace. On the demand side of advertising, a real-time bidder implementing the OpenRTB specification needs to predict which ad impressions are most likely to drive conversion events. On the supply side, an ad mediation platform needs to determine the bid floor for advertising inventory in real-time in order to optimize advertising revenue.

In a real-time ML deployment, the system replies to a request within milliseconds of the request being made. There are two general workflows for making prediction requests with a real-time system:

  1. Web Requests
  2. Streaming Workflows

In the first case, the system or client that needs a prediction makes an HTTP request to an endpoint that responds directly to the request with a prediction. Other protocols, such as gRPC, can be used for this type of workflow.

The second workflow can be implemented in a variety of ways. For example, a request can be made to a Kafka topic, where it is processed with Spark Streaming, and the result is published to a separate topic. Other streaming frameworks such as Flink or GCP Dataflow can be used to respond to prediction requests in near real time.

Over the past year, I’ve gotten hands on with Golang to build real-time ML systems. While Python can be used to implement these types of systems, Golang is typically able to respond to a higher number of requests per second for a fixed number of machines. Additionally, Golang has some elegant features for working with NoSQL data stores when building real-time ML systems. For use cases with extremely large request volumes, I’ve targeted pure Go implementations for efficiency.

In this post, I’ll discuss some of the options available for building ML systems that…

Continue reading:—-7f60cf5620c9—4