In Part 5 of “How To Deploy And Use Kubeflow On OpenShift”, we looked at machine learning in production. In this part, we turn our attention to tuning ML parameters with Kubeflow and Katib.
Kubeflow provides Katib, a scalable and flexible hyperparameter tuning framework tightly integrated with Kubernetes and applicable to any Deep Learning framework including TensorFlow, MXNet, and PyTorch. This project is inspired by Google Vizier.
As in Google Vizier, Katib is based on three main concepts:
Before creating a Study job it is necessary to create additional role and role binding. Kubeflow is typically tested used GKE, which is significantly less strict compared to OpenShift. To be able to run the study job operator successfully, additional RBAC permissions have to be given to studyjob-controller
service account under which studyjob-controller
is running. To do this, create the following yaml file defining both role and role binding and save it locally:
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: studyjobs-role
labels:
app: studyjobs
rules:
- apiGroups: ["kubeflow.org"]
resources: ["studyjobs", "studyjobs/finalizers"]
verbs: ["get", "list", "watch", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: studyjobs-rolebinding
labels:
app: studyjobs
roleRef:
kind: Role
name: studyjobs-role
subjects:
- kind: ServiceAccount
name: studyjob-controller
Next run the following command to install it:
$ oc apply -f studyjobs-role.yaml -n kubeflow
Now you can create a Study Job for Katib by defining a StudyJob config file using the following command for an example job:
oc create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/random-example.yaml
To make sure that it works, run this command:
$ oc get studyjob -n kubeflow
NAME AGE
random-example 2m
You can also look at the execution results using the Katib-UI. Go to the main kubeflow UI (see post 2) and click on the Katib dashboard. You should see a list of executed study jobs:
Clicking on a specific StudyID (g79a10864cd0492b) brings up the following study job configuration screen:
You can also look at the study results:
Additionally this UI allows you to manage study job configurations: worker templates and metrics collector templates.
To delete a job, execute:
$ oc delete studyjob random-example -n kubeflow
That’s all for this part. Check out the next post on serving ML models in production, and thanks for reading!
p.s. If you’d like to get professional guidance on best-practices and how-tos with Machine Learning, simply contact us to learn how Lightbend can help.