摘要:?TensorFlow作為現在最為流行的深度學習代碼庫,在數據科學家中間非常流行,特別是可以明顯加速訓練效率的分布式訓練更是殺手級的特性。但是如何真正部署和運行大規模的分布式模型訓練,卻成了新的挑戰。
介紹
本系列將介紹如何在阿里云容器服務上運行Kubeflow, 本文介紹如何使用TfJob運行分布式模型訓練。
TensorFlow分布式訓練和Kubernetes
TensorFlow作為現在最為流行的深度學習代碼庫,在數據科學家中間非常流行,特別是可以明顯加速訓練效率的分布式訓練更是殺手級的特性。但是如何真正部署和運行大規模的分布式模型訓練,卻成了新的挑戰。 實際分布式TensorFLow的使用者需要關心3件事情。
尋找足夠運行訓練的資源,通常一個分布式訓練需要若干數量的worker(運算服務器)和ps(參數服務器),而這些運算成員都需要使用計算資源。
安裝和配置支撐程序運算的軟件和應用
根據分布式TensorFlow的設計,需要配置ClusterSpec。這個json格式的ClusterSpec是用來描述整個分布式訓練集群的架構,比如需要使用兩個worker和ps,ClusterSpec應該長成下面的樣子,并且分布式訓練中每個成員都需要利用這個ClusterSpec初始化tf.train.ClusterSpec對象,建立集群內部通信
cluster?=?tf.train.ClusterSpec({"worker":?[":2222",???????????????????????????????????????????":2222"],????????????????????????????????"ps":?[":2223",???????????????????????????????????????":2223"]}) 其中第一件事情是Kubernetes資源調度非常擅長的事情,無論CPU和GPU調度,都是直接可以使用;而第二件事情是Docker擅長的,固化和可重復的操作保存到容器鏡像。而自動化的構建ClusterSpec是TFJob解決的問題,讓用戶通過簡單的集中式配置,完成TensorFlow分布式集群拓撲的構建。
應該說煩惱了數據科學家很久的分布式訓練問題,通過Kubernetes+TFJob的方案可以得到比較好的解決。
利用Kubernetes和TFJob部署分布式訓練
修改TensorFlow分布式訓練代碼
之前在阿里云上小試TFJob一文中已經介紹了TFJob的定義,這里就不再贅述了。可以知道TFJob里有的角色類型為MASTER,?WORKER?和?PS。
舉個現實的例子,假設從事分布式訓練的TFJob叫做distributed-mnist, 其中節點有1個MASTER, 2個WORKERS和2個PS,ClusterSpec對應的格式如下所示:
{??
????"master":[??
????????"distributed-mnist-master-0:2222"
????],????"ps":[??
????????"distributed-mnist-ps-0:2222",????????"distributed-mnist-ps-1:2222"
????],????"worker":[??
????????"distributed-mnist-worker-0:2222",????????"distributed-mnist-worker-1:2222"
????]
}而tf_operator的工作就是創建對應的5個Pod, 并且將環境變量TF_CONFIG傳入到每個Pod中,TF_CONFIG包含三部分的內容,當前集群ClusterSpec, 該節點的角色類型,以及id。比如該Pod為worker0,它所收到的環境變量TF_CONFIG為:
{??
???"cluster":{??
??????"master":[??
?????????"distributed-mnist-master-0:2222"
??????],??????"ps":[??
?????????"distributed-mnist-ps-0:2222"
??????],??????"worker":[??
?????????"distributed-mnist-worker-0:2222",?????????"distributed-mnist-worker-1:2222"
??????]
???},???"task":{??
??????"type":"worker",??????"index":0
???},???"environment":"cloud"}在這里,tf_operator負責將集群拓撲的發現和配置工作完成,免除了使用者的麻煩。對于使用者來說,他只需要在這里代碼中使用通過獲取環境變量TF_CONFIG中的上下文。
這意味著,用戶需要根據和TFJob的規約修改分布式訓練代碼:
#?從環境變量TF_CONFIG中讀取json格式的數據tf_config_json?=?os.environ.get("TF_CONFIG",?"{}")#?反序列化成python對象tf_config?=?json.loads(tf_config_json)#?獲取Cluster?Speccluster_spec?=?tf_config.get("cluster",?{})
cluster_spec_object?=?tf.train.ClusterSpec(cluster_spec)#?獲取角色類型和id,?比如這里的job_name?是?"worker"?and?task_id?是?0task?=?tf_config.get("task",?{})
job_name?=?task["type"]
task_id?=?task["index"]#?創建TensorFlow?Training?Server對象server_def?=?tf.train.ServerDef(
????cluster=cluster_spec_object.as_cluster_def(),
????protocol="grpc",
????job_name=job_name,
????task_index=task_id)
server?=?tf.train.Server(server_def)#?如果job_name為ps,則調用server.join()if?job_name?==?'ps':
????server.join()#?檢查當前進程是否是master,?如果是master,就需要負責創建session和保存summary。is_chief?=?(job_name?==?'master')#?通常分布式訓練的例子只有ps和worker兩個角色,而在TFJob里增加了master這個角色,實際在分布式TensorFlow的編程模型并沒有這個設計。而這需要使用TFJob的分布式代碼里進行處理,不過這個處理并不復雜,只需要將master也看做worker_device的類型with?tf.device(tf.train.replica_device_setter(
????worker_device="/job:{0}/task:{1}".format(job_name,task_id),
????cluster=cluster_spec)):具體代碼可以參考示例代碼
2. 在本例子中,將演示如何使用TFJob運行分布式訓練,并且將訓練結果和日志保存到NAS存儲上,最后通過Tensorboard讀取訓練日志。
2.1 創建NAS數據卷,并且設置與當前Kubernetes集群的同一個具體vpc的掛載點。操作詳見文檔
2.2 在NAS上創建?/training的數據文件夾, 下載mnist訓練所需要的數據
mkdir?-p?/nfs mount?-t?nfs?-o?vers=4.0?xxxxxxx.cn-hangzhou.nas.aliyuncs.com:/?/nfs mkdir?-p?/nfs/training umount?/nfs
2.3 創建NAS的PV, 以下為示例nas-dist-pv.yaml
apiVersion:?v1 kind:?PersistentVolume metadata: ??name:?kubeflow-dist-nas-mnist ??labels: ????tfjob:?kubeflow-dist-nas-mnist spec: ??capacity: ????storage:?10Gi ??accessModes: ????-?ReadWriteMany ??storageClassName:?nas ??flexVolume: ????driver:?"alicloud/nas" ????options: ??????mode:?"755" ??????path:?/training ??????server:?xxxxxxx.cn-hangzhou.nas.aliyuncs.com ??????vers:?"4.0"
將該模板保存到nas-dist-pv.yaml, 并且創建pv:
#?kubectl?create?-f?nas-dist-pv.yamlpersistentvolume?"kubeflow-dist-nas-mnist"?created
2.4 利用nas-dist-pvc.yaml創建PVC
kind:?PersistentVolumeClaim apiVersion:?v1metadata: ??name:?kubeflow-dist-nas-mnistspec: ??storageClassName:?nas ??accessModes: ????-?ReadWriteMany ??resources: ????requests: ??????storage:?5Gi ??selector: ????matchLabels: ??????tfjob:?kubeflow-dist-nas-mnist
具體命令:
#?kubectl?create?-f?nas-dist-pvc.yamlpersistentvolumeclaim?"kubeflow-dist-nas-mnist"?created
2.5 創建TFJob
apiVersion:?kubeflow.org/v1alpha1 kind:?TFJob metadata: ??name:?mnist-simple-gpu-dist spec: ??replicaSpecs: ????-?replicas:?1?#?1?Master ??????tfReplicaType:?MASTER ??????template: ????????spec: ??????????containers: ????????????-?image:?registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu ??????????????name:?tensorflow ??????????????env: ??????????????-?name:?TEST_TMPDIR ????????????????value:?/training??????????????command:?["python",?"/app/main.py"] ??????????????resources: ????????????????limits: ??????????????????nvidia.com/gpu:?1 ??????????????volumeMounts: ??????????????-?name:?kubeflow-dist-nas-mnist ????????????????mountPath:?"/training" ??????????volumes: ????????????-?name:?kubeflow-dist-nas-mnist ??????????????persistentVolumeClaim: ????????????????claimName:?kubeflow-dist-nas-mnist ??????????restartPolicy:?OnFailure ????-?replicas:?1?#?1?or?2?Workers?depends?on?how?many?gpus?you?have ??????tfReplicaType:?WORKER ??????template: ????????spec: ??????????containers: ??????????-?image:?registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu???????????????????????? ????????????name:?tensorflow ????????????env: ????????????-?name:?TEST_TMPDIR ??????????????value:?/training????????????command:?["python",?"/app/main.py"] ????????????imagePullPolicy:?Always ????????????resources: ??????????????limits: ????????????????nvidia.com/gpu:?1 ????????????volumeMounts: ??????????????-?name:?kubeflow-dist-nas-mnist ????????????????mountPath:?"/training" ??????????volumes: ????????????-?name:?kubeflow-dist-nas-mnist ??????????????persistentVolumeClaim: ????????????????claimName:?kubeflow-dist-nas-mnist ??????????restartPolicy:?OnFailure ????-?replicas:?1??#?1?Parameter?server ??????tfReplicaType:?PS ??????template: ????????spec: ??????????containers: ??????????-?image:?registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:cpu?????????????????????? ????????????name:?tensorflow????????????command:?["python",?"/app/main.py"] ????????????env: ????????????-?name:?TEST_TMPDIR ??????????????value:?/training ????????????imagePullPolicy:?Always ????????????volumeMounts: ??????????????-?name:?kubeflow-dist-nas-mnist ????????????????mountPath:?"/training" ??????????volumes: ????????????-?name:?kubeflow-dist-nas-mnist ??????????????persistentVolumeClaim: ????????????????claimName:?kubeflow-dist-nas-mnist ??????????restartPolicy:?OnFailure
將該模板保存到mnist-simple-gpu-dist.yaml, 并且創建分布式訓練的TFJob:
#?kubectl?create?-f?mnist-simple-gpu-dist.yamltfjob?"mnist-simple-gpu-dist"?created
檢查所有運行的Pod
#?RUNTIMEID=$(kubectl?get?tfjob?mnist-simple-gpu-dist?-o=jsonpath='{.spec.RuntimeId}')#?kubectl?get?po?-lruntime_id=$RUNTIMEIDNAME????????????????????????????????????????READY?????STATUS????RESTARTS???AGE
mnist-simple-gpu-dist-master-z5z4-0-ipy0s???1/1???????Running???0??????????31smnist-simple-gpu-dist-ps-z5z4-0-3nzpa???????1/1???????Running???0??????????31smnist-simple-gpu-dist-worker-z5z4-0-zm0zm???1/1???????Running???0??????????31s查看master的日志,可以看到ClusterSpec已經成功的構建出來了
#?kubectl?logs?-l?runtime_id=$RUNTIMEID,job_type=MASTER2018-06-10?09:31:55.342689:?I?tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]?Found?device?0?with?properties:
name:?Tesla?P100-PCIE-16GB?major:?6?minor:?0?memoryClockRate(GHz):?1.3285pciBusID:?0000:00:08.0totalMemory:?15.89GiB?freeMemory:?15.60GiB2018-06-10?09:31:55.342724:?I?tensorflow/core/common_runtime/gpu/gpu_device.cc:1195]?Creating?TensorFlow?device?(/device:GPU:0)?->?(device:?0,?name:?Tesla?P100-PCIE-16GB,?pci?bus?id:?0000:00:08.0,?compute?capability:?6.0)
2018-06-10?09:31:55.805747:?I?tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]?Initialize?GrpcChannelCache?for?job?master?->?{0?->?localhost:2222}2018-06-10?09:31:55.805786:?I?tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]?Initialize?GrpcChannelCache?for?job?ps?->?{0?->?mnist-simple-gpu-dist-ps-m5yi-0:2222}2018-06-10?09:31:55.805794:?I?tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]?Initialize?GrpcChannelCache?for?job?worker?->?{0?->?mnist-simple-gpu-dist-worker-m5yi-0:2222}2018-06-10?09:31:55.807119:?I?tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324]?Started?server?with?target:?grpc://localhost:2222...
Accuracy?at?step?900:?0.9709Accuracy?at?step?910:?0.971Accuracy?at?step?920:?0.9735Accuracy?at?step?930:?0.9716Accuracy?at?step?940:?0.972Accuracy?at?step?950:?0.9697Accuracy?at?step?960:?0.9718Accuracy?at?step?970:?0.9738Accuracy?at?step?980:?0.9725Accuracy?at?step?990:?0.9724Adding?run?metadata?for?9992.6 部署TensorBoard,并且查看訓練效果
為了更方便?TensorFlow?程序的理解、調試與優化,可以用?TensorBoard?來觀察?TensorFlow?訓練效果,理解訓練框架和優化算法, 而TensorBoard通過讀取TensorFlow的事件日志獲取運行時的信息。
在之前的分布式訓練樣例中已經記錄了事件日志,并且保存到文件events.out.tfevents*中
#?tree. └──?tensorflow ????├──?input_data ????│???├──?t10k-images-idx3-ubyte.gz ????│???├──?t10k-labels-idx1-ubyte.gz ????│???├──?train-images-idx3-ubyte.gz ????│???└──?train-labels-idx1-ubyte.gz ????└──?logs ????????├──?checkpoint ????????├──?events.out.tfevents.1528760350.mnist-simple-gpu-dist-master-fziz-0-74je9 ????????├──?graph.pbtxt ????????├──?model.ckpt-0.data-00000-of-00001 ????????├──?model.ckpt-0.index ????????├──?model.ckpt-0.meta ????????├──?test ????????│???├──?events.out.tfevents.1528760351.mnist-simple-gpu-dist-master-fziz-0-74je9 ????????│???└──?events.out.tfevents.1528760356.mnist-simple-gpu-dist-worker-fziz-0-9mvsd ????????└──?train ????????????├──?events.out.tfevents.1528760350.mnist-simple-gpu-dist-master-fziz-0-74je9 ????????????└──?events.out.tfevents.1528760355.mnist-simple-gpu-dist-worker-fziz-0-9mvsd5?directories,?14?files
在Kubernetes部署TensorBoard, 并且指定之前訓練的NAS存儲
apiVersion:?extensions/v1beta1 kind:?Deployment metadata: ??labels: ????app:?tensorboard ??name:?tensorboard spec: ??replicas:?1 ??selector: ????matchLabels: ??????app:?tensorboard ??template: ????metadata: ??????labels: ????????app:?tensorboard ????spec: ??????volumes: ??????-?name:?kubeflow-dist-nas-mnist ????????persistentVolumeClaim: ????????????claimName:?kubeflow-dist-nas-mnist ??????containers: ??????-?name:?tensorboard ????????image:?tensorflow/tensorflow:1.7.0 ????????imagePullPolicy:?Always????????command: ?????????-?/usr/local/bin/tensorboard ????????args: ????????-?--logdir ????????-?/training/tensorflow/logs ????????volumeMounts: ????????-?name:?kubeflow-dist-nas-mnist ??????????mountPath:?"/training" ????????ports: ????????-?containerPort:?6006 ??????????protocol:?TCP ??????dnsPolicy:?ClusterFirst ??????restartPolicy:?Always
將該模板保存到tensorboard.yaml, 并且創建tensorboard:
#?kubectl?create?-f?tensorboard.yamldeployment?"tensorboard"?created
TensorBoard創建成功后,通過kubectl port-forward命令進行訪問
PODNAME=$(kubectl?get?pod?-l?app=tensorboard?-o?jsonpath='{.items[0].metadata.name}')
kubectl?port-forward?${PODNAME}?6006:6006通過http://127.0.0.1:6006登錄TensorBoard,查看分布式訓練的模型和效果:
總結
利用tf-operator可以解決分布式訓練的問題,簡化數據科學家進行分布式訓練工作。同時使用Tensorboard查看訓練效果, 再利用NAS或者OSS來存放數據和模型,這樣一方面有效的重用訓練數據和保存實驗結果,另外一方面也是為模型預測的發布做準備。如何把模型訓練,驗證,預測串聯起來構成機器學習的工作流(workflow), 也是Kubeflow的核心價值,我們在后面的文章中也會進行介紹。
本文為云棲社區原創內容,未經允許不得轉載。
電子發燒友App





































































評論