Announcing Akka 24.05: More Security. More Performance. More Efficiency. Watch the Webinar Replay
openshift kubeflow deploy machine-learning installation kubernetes tensorflow pipelines data-science parameter-tuning

How To Deploy Kubeflow On Lightbend Platform With OpenShift - Part 6: ML Parameter Tuning

Boris Lublinsky Principal Architect, Lightbend, Inc.

Using Kubeflow For ML Parameter Tuning 

In Part 5 of “How To Deploy And Use Kubeflow On OpenShift”, we looked at machine learning in production. In this part, we turn our attention to tuning ML parameters with Kubeflow and Katib.

Kubeflow provides Katib, a scalable and flexible hyperparameter tuning framework tightly integrated with Kubernetes and applicable to any Deep Learning framework including TensorFlow, MXNet, and PyTorch. This project is inspired by Google Vizier.

As in Google Vizier, Katib is based on three main concepts:

  • Study - a single optimization run over a feasible space. Each Study contains a configuration describing the feasible space, as well as a set of Trials. It is assumed that objective function f(x) does not change in the course of a Study.
  • Trial - a list of parameter values, x, that will lead to a single evaluation of f(x). A Trial can be “Completed”, which means that it has been evaluated and the objective value f(x) has been assigned to it, otherwise it is “Pending”. One trial corresponds to one k8s Job.
  • Suggestion - an algorithm to construct a parameter set. Currently Katib supports the following exploration algorithms:

Before creating a Study job it is necessary to create additional role and role binding. Kubeflow is typically tested used GKE, which is significantly less strict compared to OpenShift. To be able to run the study job operator successfully, additional RBAC permissions have to be given to studyjob-controller service account under which studyjob-controller is running. To do this, create the following yaml file defining both role and role binding and save it locally:

kind: Role
  name: studyjobs-role
    app: studyjobs  
- apiGroups: [""]
  resources: ["studyjobs", "studyjobs/finalizers"]
  verbs: ["get", "list", "watch", "update", "patch"]
kind: RoleBinding
  name: studyjobs-rolebinding
    app: studyjobs 
  kind: Role
  name: studyjobs-role
  - kind: ServiceAccount
    name: studyjob-controller

Next run the following command to install it:

$ oc apply -f studyjobs-role.yaml -n kubeflow

Now you can create a Study Job for Katib by defining a StudyJob config file using the following command for an example job:

oc create -f

To make sure that it works, run this command:

$ oc get studyjob -n kubeflow
NAME             AGE
random-example   2m

You can also look at the execution results using the Katib-UI. Go to the main kubeflow UI (see post 2) and click on the Katib dashboard. You should see a list of executed study jobs:

Clicking on a specific StudyID (g79a10864cd0492b) brings up the following study job configuration screen:

You can also look at the study results:

Additionally this UI allows you to manage study job configurations: worker templates and metrics collector templates.

To delete a job, execute:

$ oc delete studyjob random-example  -n kubeflow

That’s all for this part. Check out the next post on serving ML models in production, and thanks for reading!

p.s. If you’d like to get professional guidance on best-practices and how-tos with Machine Learning, simply contact us to learn how Lightbend can help.


The Total Economic Impact™
Of Lightbend Akka

  • 139% ROI
  • 50% to 75% faster time-to-market
  • 20x increase in developer throughput
  • <6 months Akka pays for itself