This is the last post in our cloud-native series that focuses on Machine Learning in streaming data applications. To get a deeper background on general principles of streaming data (“fast data”) architectures, I recommend my free O’Reilly report, Fast Data Architectures for Streaming Applications. When it comes to serving ML models in production, Boris Lublinsky’s free O’Reilly report, Serving Machine Learning Models, explores this topic in greater depth.
In this post, we discuss some additional, real-world concerns you’ll need to factor into your model-serving applications. Specifically, we’ll discuss the impact of extra uncertainty that model serving introduces into your systems, as well as security, privacy, and regulatory issues.
Read the other articles in this series:
We software developers like things to be deterministic, so they are predictable. We’re forced to accept some loss of this property in distributed systems, because ordering is never guaranteed, but we often go to great lengths to ensure an ordering at some point. This could for business reasons, e.g., we might transactions to be recorded in the correct time order. This could be for technical reasons, e.g., when streaming video, we need to ensure the network packets of video received by a player are properly time ordered, to reconstruct the video.
But data scientists are accustomed to working in probabilities and statistics, where precise determinism is neither possible nor even needed in most cases. Model serving adds a new dimension of uncertainty to systems in production, because when a new model is deployed, the system behavior changes in a probabilistic way (at best, hopefully), not a deterministic way.
Fortunately, data scientists use the same mathematical tools to quantify errors, variances around means, performance metrics, etc. We software developers have to work with data scientists to become comfortable with this new, added non-determinism.
A practical implication of all this is the need to quantify expected behavior with well-defined tolerances. When serving models, we need to gather metrics to ensure that performance meets the requirements. Because data tends to drift over time, models usually grow stale. Performance metrics can also be used to decide when it’s time to train a new model on newer data.
Machine learning is used in lots of applications where bad models can cause degraded behavior with serious consequences. Data scientists will of course do their best to quantify the risks, but the organization also has to secure the entire environment, from data ingestion and model development to production model training and serving. Model tampering will certainly be a growing problem as models become more pervasive in our world.
Hence, the data science environment will require a level of security, especially if sensitive data is available there for model development. Furthermore, any model that is served in a production environment should be trained in a production environment as well, with equally-stringent security controls. This also makes reproducibility and traceability easier to manage. It can be tempting to let the data science team “tinker” with a model, then hand it to a developer for deployment, but that informality will quickly become unacceptable, just as it’s now considered unacceptable to manually deploy hand-built applications to production! CI/CD processes are essential for normal software delivery to production; they are essential for deploying data science models, too!
In practice, a lot of model training will need to be done at production scale anyway, due to the volume of data used and the resources required for intensive algorithms like hyperparameter tuning and final training of neural networks.
Almost all interesting data is sensitive data in one way or another. If it were completely public and unprotected, there would be no competitive advantage to the holder, as everyone could use it. But perhaps more importantly, people are justifiably concerned that sensitive data about them is managed with effective controls.
A problem with some naive applications of machine learning is that the models effectively “remember” some of the data they were trained on. Hence, even if the privacy of the data is maintained, some sensitive information can be acquired from the models trained with it, if and when those models are used outside the production environment, e.g., downloaded to phones for local use by applications.
Besides the obvious data protection techniques that can be used, a number of techniques are emerging to increase privacy protections while enabled model training and scoring to work well. Two exciting techniques are Differential Privacy and Federated Learning. Differential privacy basically says, “if I run a query over a dataset, then remove a record from it and rerun the query, will the difference in the query results reveal sensitive information from the removed record? For example1, suppose I have a database table with two columns, one with personal identifiers (e.g., the social security number - SSN - in the United State) and the other column with ones and zeros, where a one means the person has cancer, while a zero means the person doesn’t have cancer.
In this case, if I sum the second column, remove a record, then sum the column again, I can tell if the person has cancer! If the result changes, I know the person associated with the removed record had a “1” in the column, meaning they have cancer. If the sum is unchanged, the person doesn’t have cancer. This example clear fails any reasonable standard of differential privacy, but more sophisticated datasets and restricted queries can provide high levels of privacy.
In federated learning, a model is trained, then shared with other machines, often edge devices like mobile phones, where they improve the training using their local, sensitive data. The updated models are uploaded to the central data center and “averaged” in some way to improve the overall performance of the model for everyone. The local private data is not uploaded. This cycle is repeated on a regular basis with a subset of all devices to continually improve the model for everyone.
The combination of differential privacy and federated learning allows distributed systems to train their models without sharing sensitive data from the edge devices to the data center nor leaking sensitive information inadvertently. It turns out, when you use assistive typing or fingerprint/face recognition on your phone, this is exactly how these features are trained!
I believe that we’ll see widespread growth of these tools for improving the overall privacy protection of data, while still providing effective applications of machine learning, which is why I took several paragraphs to explain it and hopefully pique your interest.
Suppose you work for a bank and you’ve trained a neural network to approve or reject credit card applications. One day, someone who is a member of a minority group claims the bank discriminated against her by turning down her application. Was it a fair decision or was the system biased against this minority group? Model explainability is a major challenge for complex models, like neural networks, where it can be difficult to understand how differences in input data affect the results.
This is obviously an important, complicated topic, but for our purposes, we’ll focus on a few implications. In order to investigate the decision that was made, you’ll need the data that was used to compute the recommendation, any quality metrics output with the score, the version of the model that was used for this decision, that model’s parameters and hyperparameters, and possibly all the data used to train the model!
In other words, you’ll need traceability and auditing of everything from training to scoring. Only then can you hope to prove that a fair decision was made or that the system was in fact biased, because the training data was biased or the model was biased in some way.
We’ve only scratched the surface of production concerns when deploying machine learning. A few general themes emerged in our discussion. We need careful control over all data, both to protect privacy and to track what data was used to train particular models. This privacy starts in the data center environment. All models deployed in production need to be trained using automated, controlled processes. We need to use techniques that don’t leak sensitive information through the models, which might need to be shared outside our environment in order to be used. Models themselves need to be secured against tampering. We need the ability to explain scoring results. We need to audit enough information for regulatory compliance and to improve our services for better customer satisfaction.
All of these requirements are in high demand from Lightbend’s customers, which is why we created Lightbend Pipelines to help smooth the modernization challenges of using streaming data applications with Machine Learning.
1 Taken from this excellent Udacity course, https://www.udacity.com/course/secure-and-private-ai--ud185.↩