With the advances in machine learning the problem of model serving is quickly becoming one of the hottest topics. The two fundamental approaches to model serving are embedding, where the models are deployed directly into the application and model serving as a service, where a separate service dedicated to model serving can be used from any application in the enterprise. In this blog post we will look at different implementations of model serving as a service and show how to use one of the most popular model servers available: TensorFlow Serving.
According to this blog post:
Model servers simplify the task of deploying machine learning at scale, the same way app servers simplify the task of delivering a web app or API to end users. The rise of model servers, will likely accelerate the adoption of user-facing machine learning in the wild.
A popular pattern for real time data processing (including model serving) uses streaming. When combining streaming with model serving, the overall architecture looks like this:
When considering this type of architecture it is necessary to keep in mind its advantages and disadvantages:
Despite some disadvantages, model servers are still one of the most popular approaches to putting model serving in production.
When it comes to the model server implementation, there are two popular approaches:
One of the most popular models as code implementation is Clipper from Rise Lab.
Clipper considers trained models as functions that take some input and produce some output, which makes Clipper a function server. These functions can be implemented in any way and the use of Docker images makes it easy to include all of a model’s dependencies in a self-contained environment.
Another example of general purpose model serving tool is AWS SageMaker.
SageMaker allows you to run inference on both provided and custom models (packaged as Docker images).
In addition to general-purpose model servers, there are implementations of model servers dedicated to specific frameworks, including:
Finally Seldon-core provides Kubernetes native deployment for model serving graph, including:
Seldon-core is implemented as a Kubernetes operator, which allows to specify desired graph (and its images) in a form of yaml file.
Model as data takes a different approach. Instead of leveraging model code for scoring it exports a model as data, which allows you to use different runtimes for machine learning and model serving, thus allowing for greater flexibility for model serving implementations. Examples of intermediate formats include Predictive Model Markup Language (PMML), PFA and ONNX. In addition to standards, Tensorflow also uses its own intermediate format, based on binary Protocol buffers.
A model serving implementation based on the Tensorflow export format is TensorFlow Serving.
Tensorflow serving uses exported TensorFlow models and supports running predictions on them using RPC or gRPC. TensorFlow Serving can be configured to use either:
In the remainder of the post we will show how to use TensorFlow Serving.
The simplest way to start using Tensorflow Serving is by using one of the provided Docker images. Once the image is loaded and an exported model is created, you can start a server using the following command:
docker run -p 8500:8500 --mount type=bind,source= \
-e MODEL_NAME= -t tensorflow/serving &
Once the image is installed, you can validate that it is running correctly, by querying the server about the deployed model, using the following curl command:
curl http://http://localhost:8500/v1//models//versions/
{
"model_version_status": Array[1][
{
"version":,
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": ""
}
}
]
}
You can also try the prediction by running the following command:
curl -X POST http://localhost:8500/v1/models//versions/:predict -d ' '
{
"predictions": [
{ }
]
}
To understand the format for both request and prediction, you can always run this command
curl http://http://localhost:8501/v1//models//versions/metadata
It returns the complete definition of model input and output.
Programmatic access to TensorFlow Serving can be done as follows, using this example taken from Lightbend’s free tutorial on model serving:
val responseFuture: Future[HttpResponse] = Http().singleRequest(HttpRequest(
method = HttpMethods.POST,
uri = "http://localhost:8501/v1/models/wine/versions/1:predict",
entity = HttpEntity(ContentTypes.`application/json`, gson.toJson(request))
))
// Get Result
responseFuture
.onComplete {
case Success(res) =>
Unmarshal(res.entity).to[String].map(pString => {
val prediction = gson.fromJson(pString, classOf[Prediction]).predictions(0).toSeq
….
})
case Failure(_) => sys.error("something wrong")
….
}
In this example we are using Akka HTTP to send HTTP requests to the server and to process the response. The example uses Wine model with version 1. It also uses GSON for converting case classes to and from JSON.
As the popularity of machine learning for solving real life problems grows, so does the popularity of model servers, allowing to simplify and accelerate these deployments. In addition to the model serving itself, they often provide additional features, including caching, monitoring, load-balancing, testing, security, and more.
Last year, I wrote an O'Reilly eBook titled Serving Machine Learning Models: A Guide to Architecture, Stream Processing Engines, and Frameworks, which I encourage you to download for additional learning. I hope you enjoy it!