本頁面由 Cloud Translation API 翻譯而成。

在 GKE Standard 模式中，使用 GPU 訓練模型

標準

本快速入門教學課程說明如何在 Google Kubernetes Engine (GKE) 中使用 GPU 部署訓練模型，並將預測結果儲存在 Cloud Storage 中。本教學課程使用 TensorFlow 模型和 GKE Standard 叢集。您也可以在 Autopilot 叢集上執行這些工作負載，設定步驟較少。如需操作說明，請參閱「在 GKE Autopilot 模式中，使用 GPU 訓練模型」。

本文適用於已有標準叢集，且想首次執行 GPU 工作負載的 GKE 管理員。

事前準備

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine and Cloud Storage APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine and Cloud Storage APIs.

Enable the APIs

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

複製範例存放區

在 Cloud Shell 中執行下列指令：

git clone https://github.com/GoogleCloudPlatform/ai-on-gke/ ai-on-gke
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

建立 Standard 模式叢集和 GPU 節點集區

使用 Cloud Shell 執行下列操作：

建立使用 Workload Identity Federation for GKE 的 Standard 叢集，並安裝 Cloud Storage FUSE 驅動程式：

gcloud container clusters create gke-gpu-cluster \
    --addons GcsFuseCsiDriver \
    --location=us-central1 \
    --num-nodes=1 \
    --workload-pool=PROJECT_ID.svc.id.goog

將 PROJECT_ID 替換為專案 ID。 Google Cloud

建立叢集可能需要幾分鐘的時間。

建立 GPU 節點集區：

gcloud container node-pools create gke-gpu-pool-1 \
    --accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default \
    --machine-type=n1-standard-16 --num-nodes=1 \
    --location=us-central1 \
    --cluster=gke-gpu-cluster

建立 Cloud Storage 值區

前往 Google Cloud 控制台的「Create a bucket」(建立 bucket) 頁面：

前往「建立 bucket」
在「Name your bucket」(為 bucket 命名) 欄位中輸入下列名稱：
```
PROJECT_ID-gke-gpu-bucket
```
按一下「繼續」。
在「位置類型」部分，選取「區域」。
在「Region」(區域) 清單中選取「us-central1 (Iowa)」，然後按一下「Continue」(繼續)。
在「Choose a storage class for your data」(為資料選擇儲存空間級別) 專區中，按一下「Continue」(繼續)。
在「Choose how to control access to objects」(選取如何控制物件的存取權) 專區中，選取「Access control」(存取控管) 的「Uniform」(統一)。
點選「建立」。
在「系統會禁止公開存取」對話方塊中，確認已選取「強制禁止公開存取這個 bucket」核取方塊，然後按一下「確認」。

設定叢集，透過 Workload Identity Federation for GKE 存取儲存空間

如要讓叢集存取 Cloud Storage bucket，請執行下列操作：

建立 Google Cloud 服務帳戶。
在叢集中建立 Kubernetes ServiceAccount。
將 Kubernetes ServiceAccount 繫結至 Google Cloud 服務帳戶。

建立 Google Cloud 服務帳戶

前往 Google Cloud 控制台的「建立服務帳戶」頁面：

前往「Create service account」(建立服務帳戶)
在「Service account ID」(服務帳戶 ID) 欄位中輸入 gke-ai-sa。
按一下「建立並繼續」。
在「角色」清單中，選取「Cloud Storage」>「Storage Insights Collector Service」(Storage Insights 收集器服務) 角色。
按一下「Add another role」(新增其他角色)。
在「請選擇角色」清單中，選取「Cloud Storage」>「Storage 物件管理員」角色。
依序點選「繼續」和「完成」。

在叢集中建立 Kubernetes ServiceAccount

在 Cloud Shell 中執行下列操作：

建立 Kubernetes 命名空間：

kubectl create namespace gke-ai-namespace

在命名空間中建立 Kubernetes ServiceAccount：

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-ai-namespace

將 Kubernetes ServiceAccount 繫結至 Google Cloud 服務帳戶

在 Cloud Shell 中執行下列指令：

將 IAM 繫結新增至 Google Cloud 服務帳戶：

gcloud iam service-accounts add-iam-policy-binding gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[gke-ai-namespace/gpu-k8s-sa]"

--member 標記會提供 Google Cloud中 Kubernetes ServiceAccount 的完整身分。

為 Kubernetes ServiceAccount 加上註解：

kubectl annotate serviceaccount gpu-k8s-sa \
    --namespace gke-ai-namespace \
    iam.gke.io/gcp-service-account=gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com

確認 Pod 可以存取 Cloud Storage bucket

在 Cloud Shell 中建立下列環境變數：
```
export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket
```
將 PROJECT_ID 替換為專案 ID。 Google Cloud
建立具有 TensorFlow 容器的 Pod：
```
envsubst < src/gke-config/standard-tensorflow-bash.yaml | kubectl --namespace=gke-ai-namespace apply -f -
```
這個指令會將您建立的環境變數，代入資訊清單中的對應參照。您也可以在文字編輯器中開啟資訊清單，然後將 $K8S_SA_NAME 和 $BUCKET_NAME 替換為對應的值。

在 bucket 中建立範例檔案：

touch sample-file
gcloud storage cp sample-file gs://PROJECT_ID-gke-gpu-bucket

等待 Pod 準備就緒：

kubectl wait --for=condition=Ready pod/test-tensorflow-pod -n=gke-ai-namespace --timeout=180s

Pod 準備就緒時，輸出內容如下：

pod/test-tensorflow-pod condition met

在 TensorFlow 容器中開啟殼層：

kubectl -n gke-ai-namespace exec --stdin --tty test-tensorflow-pod --container tensorflow -- /bin/bash

嘗試讀取您建立的範例檔案：
```
ls /data
```
輸出畫面會顯示範例檔案。

查看記錄，找出附加至 Pod 的 GPU：

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

輸出內容會顯示附加至 Pod 的 GPU，如下所示：

...
PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')

結束容器：
```
exit
```

刪除範例 Pod：

kubectl delete -f src/gke-config/standard-tensorflow-bash.yaml \
    --namespace=gke-ai-namespace

使用 `MNIST` 資料集訓練及預測

在本節中，您將在 MNIST 範例資料集上執行訓練工作負載。

將範例資料複製到 Cloud Storage 值區：

gcloud storage cp src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/ --recursive

建立下列的環境變數：

export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket

查看訓練工作：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-training-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_train_distributed.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

部署訓練工作：
```
envsubst < src/gke-config/standard-tf-mnist-train.yaml | kubectl -n gke-ai-namespace apply -f -
```
這個指令會將您建立的環境變數，代入資訊清單中的對應參照。您也可以在文字編輯器中開啟資訊清單，然後將 $K8S_SA_NAME 和 $BUCKET_NAME 替換為對應的值。

請等待工作達到 Completed 狀態：

kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-training-job --timeout=180s

輸出結果會與下列內容相似：

job.batch/mnist-training-job condition met

檢查 TensorFlow 容器的記錄：

kubectl logs -f jobs/mnist-training-job -c tensorflow -n gke-ai-namespace

輸出內容會顯示下列事件：

安裝必要的 Python 套件
下載 MNIST 資料集
使用 GPU 訓練模型
儲存模型
評估模型

...
Epoch 12/12
927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954
Learning rate for epoch 12 is 9.999999747378752e-06
938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05
157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861
Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446
Training finished. Model saved

刪除訓練工作負載：

kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-train.yaml

部署推論工作負載

在本節中，您會部署推論工作負載，將範例資料集做為輸入內容，並傳回預測結果。

將用於預測的圖片複製到 bucket：

gcloud storage cp data/mnist_predict gs://PROJECT_ID-gke-gpu-bucket/ --recursive

查看推論工作負載：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-batch-prediction-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_batch_predict.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

部署推論工作負載：
```
envsubst < src/gke-config/standard-tf-mnist-batch-predict.yaml | kubectl -n gke-ai-namespace apply -f -
```
這個指令會將您建立的環境變數，代入資訊清單中的對應參照。您也可以在文字編輯器中開啟資訊清單，然後將 $K8S_SA_NAME 和 $BUCKET_NAME 替換為對應的值。

請等待工作達到 Completed 狀態：

kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-batch-prediction-job --timeout=180s

輸出結果會與下列內容相似：

job.batch/mnist-batch-prediction-job condition met

檢查 TensorFlow 容器的記錄：

kubectl logs -f jobs/mnist-batch-prediction-job -c tensorflow -n gke-ai-namespace

輸出內容會顯示每張圖片的預測結果，以及模型對預測結果的信心程度，如下所示：

Found 10 files belonging to 1 classes.
1/1 [==============================] - 2s 2s/step
The image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence.
The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence.
The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence.
The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence.
The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence.
The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence.
The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence.
The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence.
The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence.
The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.

清除所用資源

如要避免系統向您的 Google Cloud 帳戶收取本指南所建立資源的費用，請採取下列任一做法：

保留 GKE 叢集：刪除叢集中的 Kubernetes 資源和 Google Cloud 資源
保留專案：刪除 GKE 叢集和 Google Cloud 資源 Google Cloud
刪除專案

刪除叢集中的 Kubernetes 資源和 Google Cloud 資源

刪除 Kubernetes 命名空間和您部署的工作負載：

kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-batch-predict.yaml
kubectl delete namespace gke-ai-namespace

刪除 Cloud Storage bucket：
1. 前往「Buckets」(值區) 頁面：
  
  前往值區
2. 勾選 PROJECT_ID-gke-gpu-bucket 的核取方塊。
3. 按一下「刪除」圖示。
4. 如要確認刪除，請輸入 DELETE，然後按一下「刪除」。
刪除 Google Cloud 服務帳戶：
1. 前往「Service accounts」(服務帳戶) 頁面：
  
  前往「Service accounts」(服務帳戶)
2. 選取專案。
3. 勾選 gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com 的核取方塊。
4. 按一下「刪除」圖示。
5. 如要確認刪除，請按一下「刪除」。

刪除 GKE 叢集和 Google Cloud 資源

刪除 GKE 叢集：
1. 前往「Clusters」(叢集) 頁面：
  
  前往「Clusters」(叢集)
2. 勾選 gke-gpu-cluster 的核取方塊。
3. 按一下「刪除」圖示。
4. 如要確認刪除，請輸入 gke-gpu-cluster，然後按一下「刪除」。
刪除 Cloud Storage bucket：
1. 前往「Buckets」(值區) 頁面：
  
  前往值區
2. 勾選 PROJECT_ID-gke-gpu-bucket 的核取方塊。
3. 按一下「刪除」圖示。
4. 如要確認刪除，請輸入 DELETE，然後按一下「刪除」。
刪除 Google Cloud 服務帳戶：
1. 前往「Service accounts」(服務帳戶) 頁面：
  
  前往「Service accounts」(服務帳戶)
2. 選取專案。
3. 勾選 gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com 的核取方塊。
4. 按一下「刪除」圖示。
5. 如要確認刪除，請按一下「刪除」。

刪除專案

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

後續步驟

進一步瞭解如何在 GKE 中使用 GPU

在 GKE Standard 模式中，使用 GPU 訓練模型

事前準備

複製範例存放區

建立 Standard 模式叢集和 GPU 節點集區

建立 Cloud Storage 值區

設定叢集，透過 Workload Identity Federation for GKE 存取儲存空間

建立 Google Cloud 服務帳戶

在叢集中建立 Kubernetes ServiceAccount

將 Kubernetes ServiceAccount 繫結至 Google Cloud 服務帳戶

確認 Pod 可以存取 Cloud Storage bucket

使用 MNIST 資料集訓練及預測

部署推論工作負載

清除所用資源

刪除叢集中的 Kubernetes 資源和 Google Cloud 資源

刪除 GKE 叢集和 Google Cloud 資源

刪除專案

後續步驟

使用 `MNIST` 資料集訓練及預測