Design Techniques for Serving Machine Learning Models in Cloud-Native Applications

With the advances in machine learning the problem of model serving is quickly becoming one of the hottest topics. The two fundamental approaches to model serving are embedding, where the models are deployed directly into the application and model serving as a service, where a separate service dedicated to model serving can be used from any application in the enterprise. In this blog post we will look at different implementations of model serving as a service and show how to use one of the most popular model servers available: TensorFlow Serving.

Why Model Servers?

According to this blog post:

Model servers simplify the task of deploying machine learning at scale, the same way app servers simplify the task of delivering a web app or API to end users. The rise of model servers, will likely accelerate the adoption of user-facing machine learning in the wild.

A popular pattern for real time data processing (including model serving) uses streaming. When combining streaming with model serving, the overall architecture looks like this:

When considering this type of architecture it is necessary to keep in mind its advantages and disadvantages:

  • Advantages
    • Simplifies integration with other technologies and organizational processes
    • Easier to understand if you come from a non-streaming world, i.e., where you are accustomed to working with “as a service” deployments
  • Disadvantages
    • More applications and running services are required in the overall implementation, requiring additional management, monitoring, etc.
    • Usage of remote calls for serving creates additional latency and unpredictability for the overall processing time
    • Tight temporal coupling to the model server, that can impact overall SLA and throughput

Despite some disadvantages, model servers are still one of the most popular approaches to putting model serving in production.

Model Server Implementation

When it comes to the model server implementation, there are two popular approaches:

  • Model as code
  • Model as data

One of the most popular models as code implementation is Clipper from Rise Lab.

From http://clipper.ai/tutorials/basic_concepts/

Clipper considers trained models as functions that take some input and produce some output, which makes Clipper a function server. These functions can be implemented in any way and the use of Docker images makes it easy to include all of a model’s dependencies in a self-contained environment.

Another example of general purpose model serving tool is AWS SageMaker.

From https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

SageMaker allows you to run inference on both provided and custom models (packaged as Docker images).

In addition to general-purpose model servers, there are implementations of model servers dedicated to specific frameworks, including:

Finally Seldon-core provides Kubernetes native deployment for model serving graph, including:

  • Model - a microservice, that returns prediction, for example, TensorFlow, sklearn or any other models, packaged as a docker image
  • Router - a microservice to route requests to one of its children, for example, A-B Tests, Multi-Armed Bandits
  • Combiner - a microservice to combine responses from its children into a single response
  • Transformer - a microservice to transform an input, for example, feature normalization, outlier detection, concept drift
From https://docs.seldon.io/projects/seldon-core/en/latest/workflow/README.html

Seldon-core is implemented as a Kubernetes operator, which allows to specify desired graph (and its images) in a form of yaml file.

Model as data takes a different approach. Instead of leveraging model code for scoring it exports a model as data, which allows you to use different runtimes for machine learning and model serving, thus allowing for greater flexibility for model serving implementations. Examples of intermediate formats include Predictive Model Markup Language (PMML), PFA and ONNX. In addition to standards, Tensorflow also uses its own intermediate format, based on binary Protocol buffers.

A model serving implementation based on the Tensorflow export format is TensorFlow Serving.

From: https://www.tensorflow.org/tfx/serving/architecture

Tensorflow serving uses exported TensorFlow models and supports running predictions on them using RPC or gRPC. TensorFlow Serving can be configured to use either:

  • A single (latest) version of the model
  • Multiple, specific versions of the model

In the remainder of the post we will show how to use TensorFlow Serving.

Using TensorFlow Serving

The simplest way to start using Tensorflow Serving is by using one of the provided Docker images. Once the image is loaded and an exported model is created, you can start a server using the following command:

docker run -p 8500:8500 --mount type=bind,source= \
-e MODEL_NAME= -t tensorflow/serving &

Once the image is installed, you can validate that it is running correctly, by querying the server about the deployed model, using the following curl command:

curl http://http://localhost:8500/v1//models//versions/
{
  "model_version_status": Array[1][
    {
      "version": ,
      "state": "AVAILABLE",
      "status": {
        "error_code": "OK",
        "error_message": ""
      }
    }
  ]
}

You can also try the prediction by running the following command:

curl -X POST http://localhost:8500/v1/models//versions/:predict -d ''
{
    "predictions": [
        {  }
    ]
}

To understand the format for both request and prediction, you can always run this command

curl http://http://localhost:8501/v1//models//versions/metadata

It returns the complete definition of model input and output.

Programmatic access to TensorFlow Serving can be done as follows, using this example taken from Lightbend’s free tutorial on model serving:

val responseFuture: Future[HttpResponse] = Http().singleRequest(HttpRequest(
          method = HttpMethods.POST,
          uri = "http://localhost:8501/v1/models/wine/versions/1:predict",
          entity = HttpEntity(ContentTypes.`application/json`, gson.toJson(request))
        ))
        // Get Result
        responseFuture
          .onComplete {
            case Success(res) =>
              Unmarshal(res.entity).to[String].map(pString => {
                val prediction = gson.fromJson(pString, classOf[Prediction]).predictions(0).toSeq
                ….
              })
            case Failure(_)   => sys.error("something wrong")
              ….
          }

In this example we are using Akka HTTP to send HTTP requests to the server and to process the response. The example uses Wine model with version 1. It also uses GSON for converting case classes to and from JSON.

Summary

As the popularity of machine learning for solving real life problems grows, so does the popularity of model servers, allowing to simplify and accelerate these deployments. In addition to the model serving itself, they often provide additional features, including caching, monitoring, load-balancing, testing, security, and more.


Learn More About Machine Learning

Last year, I wrote an O'Reilly eBook titled Serving Machine Learning Models: A Guide to Architecture, Stream Processing Engines, and Frameworks, which I encourage you to download for additional learning. I hope you enjoy it!

GET THE FREE EBOOK

 

Share



Comments


View All Posts or Filter By Tag


Questions?