openshift kubeflow deploy machine-learning installation kubernetes tensorflow pipelines data-science parameter-tuning

How To Deploy Kubeflow On Lightbend Platform With OpenShift - Part 6: ML Parameter Tuning

Boris Lublinsky Principal Architect, Lightbend, Inc.

Using Kubeflow For ML Parameter Tuning

In Part 5 of “How To Deploy And Use Kubeflow On OpenShift”, we looked at machine learning in production. In this part, we turn our attention to tuning ML parameters with Kubeflow and Katib.

Kubeflow provides Katib, a scalable and flexible hyperparameter tuning framework tightly integrated with Kubernetes and applicable to any Deep Learning framework including TensorFlow, MXNet, and PyTorch. This project is inspired by Google Vizier.

As in Google Vizier, Katib is based on three main concepts:

Study - a single optimization run over a feasible space. Each Study contains a configuration describing the feasible space, as well as a set of Trials. It is assumed that objective function f(x) does not change in the course of a Study.
Trial - a list of parameter values, x, that will lead to a single evaluation of f(x). A Trial can be “Completed”, which means that it has been evaluated and the objective value f(x) has been assigned to it, otherwise it is “Pending”. One trial corresponds to one k8s Job.
Suggestion - an algorithm to construct a parameter set. Currently Katib supports the following exploration algorithms:
- random
- grid
- hyperband
- bayesian optimization

Before creating a Study job it is necessary to create additional role and role binding. Kubeflow is typically tested used GKE, which is significantly less strict compared to OpenShift. To be able to run the study job operator successfully, additional RBAC permissions have to be given to studyjob-controller service account under which studyjob-controller is running. To do this, create the following yaml file defining both role and role binding and save it locally:

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: studyjobs-role
  labels:
    app: studyjobs  
rules:
- apiGroups: ["kubeflow.org"]
  resources: ["studyjobs", "studyjobs/finalizers"]
  verbs: ["get", "list", "watch", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: studyjobs-rolebinding
  labels:
    app: studyjobs 
roleRef:
  kind: Role
  name: studyjobs-role
subjects:
  - kind: ServiceAccount
    name: studyjob-controller

Next run the following command to install it:

$ oc apply -f studyjobs-role.yaml -n kubeflow

Now you can create a Study Job for Katib by defining a StudyJob config file using the following command for an example job:

oc create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/random-example.yaml

To make sure that it works, run this command:

$ oc get studyjob -n kubeflow
NAME             AGE
random-example   2m

You can also look at the execution results using the Katib-UI. Go to the main kubeflow UI (see post 2) and click on the Katib dashboard. You should see a list of executed study jobs:

Clicking on a specific StudyID (g79a10864cd0492b) brings up the following study job configuration screen:

You can also look at the study results:

Additionally this UI allows you to manage study job configurations: worker templates and metrics collector templates.

To delete a job, execute:

$ oc delete studyjob random-example  -n kubeflow

That’s all for this part. Check out the next post on serving ML models in production, and thanks for reading!

p.s. If you’d like to get professional guidance on best-practices and how-tos with Machine Learning, simply contact us to learn how Lightbend can help.

PART 7: SERVING ML MODELS

Author

Boris Lublinsky

Principal Architect, Lightbend, Inc.

Boris Lublinsky is a Principal Architect at Lightbend. Boris has over 30 years experience in enterprise, technical architecture, and software engineering. He is an active member of OASIS SOA RM committee, co-author of Applied SOA: Service-Oriented Architecture and Design Strategies (Wiley), Professional Hadoop Solutions (Wiley), Serving Machine Learning Models (O’Reilly) and Kubeflow for Machine Learning: From Lab to production (O’Reilly).

The Total Economic Impact™
Of Lightbend Akka

139% ROI
50% to 75% faster time-to-market
20x increase in developer throughput
<6 months Akka pays for itself

Read the full report

February 28, 2019