Machine Learning in Kubeflow (Based On Tensorflow)

In Part 4 of “How To Deploy And Use Kubeflow On OpenShift”, we looked at setting up and using JupyterHub with Kubeflow. In this post, we look closer at running Machine Learning jobs with TensorFlow.

While JupyterHub is a great tool for initial experimentation with the data and prototyping ML jobs, for putting these jobs in production we need a TensorFlow job (TFJob). A TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes, with a simple YAML representation illustrated here.

The Kubeflow implementation of TFJob is in tf-operator. Kubeflow ships with a ksonnet prototype suitable for running the TensorFlow CNN Benchmarks. You can also use this prototype to generate a component which you can then customize for your jobs.

A distributed TensorFlow job typically contains some of the following processes:

  • Chief - responsible for orchestrating training and performing tasks like checkpointing the model.
  • Ps - parameter servers; these servers provide a distributed data store for the model parameters.
  • Worker - who does the actual work of training the model. In some cases, worker 0 might also act as the chief.
  • Evaluator - can be used to compute evaluation metrics as the model is trained.

Before running the TFJob example discussed next, it is necessary to add an additional role and role binding. Kubeflow is typically tested used GKE, which is significantly less strict compared to OpenShift. To be able to run TFJob operator successfully, additional RBAC permissions have to be given to tf-job-operator service account under which tf-job-operator is running. To do this, create the following yaml file defining both role and role binding and save it locally:

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: tfjobs-role
  labels:
    app: tfjobs  
rules:
- apiGroups: ["kubeflow.org"]
  resources: ["tfjobs", "tfjobs/finalizers"]
  verbs: ["get", "list", "watch", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: tfjobs-rolebinding
  labels:
    app: tfjobs  
roleRef:
  kind: Role
  name: tfjobs-role
subjects:
  - kind: ServiceAccount
    name: tf-job-operator

Now run the following command to install it:

$ oc apply -f tfjobs-role.yaml -n kubeflow

To run the TFJob example, make sure that you are pointing to the application (kubeflow/openshift/ks_app if you followed the posts so far) and run the following commands:

$ CNN_JOB_NAME=testcnnjob
$ VERSION=master
$ ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
$ ks pkg install kubeflow-git/examples
$ ks generate tf-job-simple ${CNN_JOB_NAME} --name=${CNN_JOB_NAME}

This generates a testcnnjob.jsonnet file that can be used to run the job. Unfortunately the generated file uses the older API version, v1alpha2.

local tfjob = {
  apiVersion: "kubeflow.org/v1alpha2",
  kind: "TFJob",
  metadata: {
    name: name,
    namespace: namespace,
  }

You can see the correct version by running this command:

$ oc describe crd tfjobs.kubeflow.org

It returns the following information about the currently installed TFJobs CRD:

Name:         tfjobs.kubeflow.org
Namespace:    
……..
  Version:                      v1beta1
……..

It specifies the current version as v1beta1. So, edit the version string in the generated testcnnjob.jsonnet file, using v1beta1 instead. Then we can submit the job using the following command:

$ ks apply default -c ${CNN_JOB_NAME}

Here “default” is the name of the environment, which can be determined by running the following command, showing the values for my test cluster:

$ ks env list
NAME    OVERRIDE KUBERNETES-VERSION NAMESPACE SERVER
====    ======== ================== ========= ======
default          v1.10.0            kubeflow  https://streampipe.lightbend.com:443
  • We can now validate that the job is running with this command:
$ oc get -n kubeflow tfjobs ${CNN_JOB_NAME}
NAME         AGE
testcnnjob   9m

To see job-execution details, go to the kubeflow UI (see post 2) and click on the TFJob dashboard. You should get a screen that looks as follows:

To see the actual execution result click on testcnnjob-worker-0 log and this will pop up a window with the execution results.

To delete a job, run the following command:

$ ks delete default -c ${CNN_JOB_NAME}

Additionally, Kubeflow allows us to do training using MPI, MxNet and PyTorch.

That’s all for this part. Check out the next post on ML Parameter Tuning and thanks for reading!

p.s. If you’d like to get professional guidance on best-practices and how-tos with Machine Learning, simply contact us to learn how Lightbend can help.

PART 6: ML PARAMETER TUNING

Share



Comments


View All Posts or Filter By Tag