Managing Models Using ModelDB

In Part 8 of “How To Deploy And Use Kubeflow On OpenShift”, we looked at deployment operations using Kubeflow pipelines. In this final part of the series, we look at model management as the last component of Kubeflow that we will describe, ModelDB.

Many organizations build hundreds of models a day, but it is very hard to manage all the models that are built over time. ModelDB is an end-to-end system that tracks models as they are built, extracts and stores relevant metadata (e.g., hyperparameters, data sources) for models, and makes this data available for easy querying and visualization. ModelDB organizes model data in a three-level hierarchy, from bottom to top:

  • Experiment run: every execution of a script/program creates an experiment run.
  • Experiment: related experiment runs can be grouped into an Experiment (for example, “running hyperparameter optimization for the Neural Network”).
  • Project: Finally, all experiments belong to a Project (for example, “recommender”).

The main use cases for usage of ModelDB include:

  • Tracking Modeling Experiments
  • Versioning Models
  • Ensuring Reproducibility
  • Visual exploration of models and results
  • Collaboration

ModelDB is not part of the “standard” Kubeflow install. It can be installed using the following command:

$ ks generate modeldb modeldb
$ ks apply default -c modeldb

Once installed, ModelDB can be populated (by writing directly to the ModelDB database) in one of the following ways:

  • Light API (Python) is a way for users to integrate any ML workflow with ModelDB. 
  • Scikit-learn (Python) is the library for the scikit-learn client for ModelDB
  • Spark.ml (Scala) is a library for Spark that allows you to write Spark ML models to ModelDB.  

Some usage examples of ModelDB APIs can be found here. We will show a simple notebook (leveraging Light API) to populate Model DB:

#Install ModelDB
!pip install modeldb --upgrade
from modeldb.basic.Structs import (
    Model, ModelConfig, ModelMetrics, Dataset)
from modeldb.basic.ModelDbSyncerBase import Syncer
#Create a syncer using a convenience API
syncer_obj = Syncer.create_syncer("Project Name", 
   "test_user", 
   "project description", 
   host="modeldb-backend")
# create Datasets by specifying their filepaths and optional metadata
# associate a tag (key) for each Dataset (value) and synch them
datasets = {
    "train" : Dataset("/path/to/train", {"num_cols" : 15, "dist" : "random"}),
    "test" : Dataset("/path/to/test", {"num_cols" : 15, "dist" : "gaussian"})
}
syncer_obj.sync_datasets(datasets)
# create the Model, ModelConfig, and ModelMetrics instances and synch them
model = "model_obj"
model_type = "NN"
mdb_model1 = Model(model_type, model, "/path/to/model1")
model_config1 = ModelConfig(model_type, {"l1" : 10})
model_metrics1 = ModelMetrics({"accuracy" : 0.8})
syncer_obj.sync_model("train", model_config1, mdb_model1)
syncer_obj.sync_metrics("test", mdb_model1, model_metrics1)
# actually write it
syncer_obj.sync()

Once this program runs, we can expose modeldb-frontend service as a route and see what is created. Here is the list of projects

And here is the information about project that we have created

I hope you enjoyed this series on setting up and deploying machine learning on Kubeflow and OpenShift. If you’d like to get professional guidance on best-practices and how-tos with Machine Learning, simply contact us to learn how Lightbend can help.

GO BACK TO PART 1

Share



Comments


View All Posts or Filter By Tag


Questions?