In Part 5 of “How To Deploy And Use Kubeflow On OpenShift”, we looked at machine learning in production. In this part, we turn our attention to tuning ML parameters with Kubeflow and Katib.
Kubeflow provides Katib, a scalable and flexible hyperparameter tuning framework tightly integrated with Kubernetes and applicable to any Deep Learning framework including TensorFlow, MXNet, and PyTorch. This project is inspired by Google Vizier.
As in Google Vizier, Katib is based on three main concepts:
Before creating a Study job it is necessary to create additional role and role binding. Kubeflow is typically tested used GKE, which is significantly less strict compared to OpenShift. To be able to run the study job operator successfully, additional RBAC permissions have to be given to
studyjob-controller service account under which
studyjob-controller is running. To do this, create the following yaml file defining both role and role binding and save it locally:
kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: name: studyjobs-role labels: app: studyjobs rules: - apiGroups: ["kubeflow.org"] resources: ["studyjobs", "studyjobs/finalizers"] verbs: ["get", "list", "watch", "update", "patch"] --- kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: studyjobs-rolebinding labels: app: studyjobs roleRef: kind: Role name: studyjobs-role subjects: - kind: ServiceAccount name: studyjob-controller
Next run the following command to install it:
$ oc apply -f studyjobs-role.yaml -n kubeflow
Now you can create a Study Job for Katib by defining a StudyJob config file using the following command for an example job:
oc create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/random-example.yaml
To make sure that it works, run this command:
$ oc get studyjob -n kubeflow NAME AGE random-example 2m
You can also look at the execution results using the Katib-UI. Go to the main kubeflow UI (see post 2) and click on the Katib dashboard. You should see a list of executed study jobs:
Clicking on a specific StudyID (g79a10864cd0492b) brings up the following study job configuration screen:
You can also look at the study results:
Additionally this UI allows you to manage study job configurations: worker templates and metrics collector templates.
To delete a job, execute:
$ oc delete studyjob random-example -n kubeflow
That’s all for this part. Check out the next post on serving ML models in production, and thanks for reading!
p.s. If you’d like to get professional guidance on best-practices and how-tos with Machine Learning, simply contact us to learn how Lightbend can help.