In Part 4 of “How To Deploy And Use Kubeflow On OpenShift”, we looked at setting up and using JupyterHub with Kubeflow. In this post, we look closer at running Machine Learning jobs with TensorFlow.
While JupyterHub is a great tool for initial experimentation with the data and prototyping ML jobs, for putting these jobs in production we need a TensorFlow job (TFJob). A TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes, with a simple YAML representation illustrated here.
The Kubeflow implementation of TFJob is in tf-operator. Kubeflow ships with a ksonnet prototype suitable for running the TensorFlow CNN Benchmarks. You can also use this prototype to generate a component which you can then customize for your jobs.
A distributed TensorFlow job typically contains some of the following processes:
Before running the TFJob example discussed next, it is necessary to add an additional role and role binding. Kubeflow is typically tested used GKE, which is significantly less strict compared to OpenShift. To be able to run TFJob operator successfully, additional RBAC permissions have to be given to tf-job-operator
service account under which tf-job-operator
is running. To do this, create the following yaml file defining both role and role binding and save it locally:
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: tfjobs-role
labels:
app: tfjobs
rules:
- apiGroups: ["kubeflow.org"]
resources: ["tfjobs", "tfjobs/finalizers"]
verbs: ["get", "list", "watch", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: tfjobs-rolebinding
labels:
app: tfjobs
roleRef:
kind: Role
name: tfjobs-role
subjects:
- kind: ServiceAccount
name: tf-job-operator
Now run the following command to install it:
$ oc apply -f tfjobs-role.yaml -n kubeflow
To run the TFJob example, make sure that you are pointing to the application (kubeflow/openshift/ks_app if you followed the posts so far) and run the following commands:
$ CNN_JOB_NAME=testcnnjob
$ VERSION=master
$ ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
$ ks pkg install kubeflow-git/examples
$ ks generate tf-job-simple ${CNN_JOB_NAME} --name=${CNN_JOB_NAME}
This generates a testcnnjob.jsonnet file that can be used to run the job. Unfortunately the generated file uses the older API version, v1alpha2.
local tfjob = {
apiVersion: "kubeflow.org/v1alpha2",
kind: "TFJob",
metadata: {
name: name,
namespace: namespace,
}
You can see the correct version by running this command:
$ oc describe crd tfjobs.kubeflow.org
It returns the following information about the currently installed TFJobs CRD:
Name: tfjobs.kubeflow.org
Namespace:
……..
Version: v1beta1
……..
It specifies the current version as v1beta1. So, edit the version string in the generated testcnnjob.jsonnet file, using v1beta1 instead. Then we can submit the job using the following command:
$ ks apply default -c ${CNN_JOB_NAME}
Here “default” is the name of the environment, which can be determined by running the following command, showing the values for my test cluster:
$ ks env list
NAME OVERRIDE KUBERNETES-VERSION NAMESPACE SERVER
==== ======== ================== ========= ======
default v1.10.0 kubeflow https://streampipe.lightbend.com:443
$ oc get -n kubeflow tfjobs ${CNN_JOB_NAME}
NAME AGE
testcnnjob 9m
To see job-execution details, go to the kubeflow UI (see post 2) and click on the TFJob dashboard. You should get a screen that looks as follows:
To see the actual execution result click on testcnnjob-worker-0
log and this will pop up a window with the execution results.
To delete a job, run the following command:
$ ks delete default -c ${CNN_JOB_NAME}
Additionally, Kubeflow allows us to do training using MPI, MxNet and PyTorch.
That’s all for this part. Check out the next post on ML Parameter Tuning and thanks for reading!
p.s. If you’d like to get professional guidance on best-practices and how-tos with Machine Learning, simply contact us to learn how Lightbend can help.