训练作业概述 #

概述 #

Kubeflow Training Operator 提供了在 Kubernetes 上运行分布式机器学习训练的能力,支持多种主流深度学习框架。

核心功能 #

text
┌─────────────────────────────────────────────────────────────┐
│                  Training Operator 功能                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  分布式训练:                                                │
│  ├── 多节点多 GPU 训练                                      │
│  ├── 数据并行                                              │
│  ├── 模型并行                                              │
│  └── 混合并行                                              │
│                                                             │
│  框架支持:                                                  │
│  ├── TensorFlow (TFJob)                                    │
│  ├── PyTorch (PyTorchJob)                                  │
│  ├── MPI/Horovod (MPIJob)                                  │
│  ├── XGBoost (XGBoostJob)                                  │
│  └── MXNet (MXJob)                                         │
│                                                             │
│  资源管理:                                                  │
│  ├── GPU 调度                                              │
│  ├── 资源配额                                              │
│  ├── 故障恢复                                              │
│  └── 自动重启                                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

核心概念 #

Training Operator #

Training Operator 是 Kubeflow 的核心组件,负责管理各种训练作业的 CRD。

text
Training Operator 架构:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Training Operator                       │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │   │
│  │  │ TFJob   │ │PyTorch  │ │ MPIJob  │ │XGBoost  │   │   │
│  │  │Controller│ │Controller│ │Controller│ │Controller│   │   │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘   │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│                          ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │               Kubernetes API Server                  │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│                          ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   Training Pods                      │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐               │   │
│  │  │ Master  │ │ Worker  │ │ Worker  │               │   │
│  │  │  Pod    │ │  Pod 1  │ │  Pod 2  │               │   │
│  │  └─────────┘ └─────────┘ └─────────┘               │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

分布式训练模式 #

text
┌─────────────────────────────────────────────────────────────┐
│                    分布式训练模式                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  参数服务器模式 (Parameter Server):                         │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐              │
│  │  PS 1   │     │  PS 2   │     │  PS 3   │              │
│  └────┬────┘     └────┬────┘     └────┬────┘              │
│       │               │               │                    │
│       └───────────────┼───────────────┘                    │
│                       │                                    │
│       ┌───────────────┼───────────────┐                    │
│       │               │               │                    │
│  ┌────┴────┐     ┌────┴────┐     ┌────┴────┐              │
│  │Worker 1 │     │Worker 2 │     │Worker 3 │              │
│  └─────────┘     └─────────┘     └─────────┘              │
│                                                             │
│  AllReduce 模式:                                            │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐              │
│  │Worker 1 │←───→│Worker 2 │←───→│Worker 3 │              │
│  └─────────┘     └─────────┘     └─────────┘              │
│       ↑               ↑               ↑                    │
│       └───────────────┴───────────────┘                    │
│                   环形通信                                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Job 角色 #

text
训练作业角色:

Master/Chief:
├── 协调训练过程
├── 保存检查点
├── 写入日志
└── 主节点

Worker:
├── 执行训练计算
├── 处理数据批次
├── 更新模型参数
└── 工作节点

Parameter Server (PS):
├── 存储模型参数
├── 参数聚合
├── 参数分发
└── 参数服务器

Evaluator:
├── 模型评估
├── 计算指标
└── 验证节点

支持的训练作业类型 #

TFJob #

yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tensorflow-job
  namespace: kubeflow-user-example-com
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.12.0
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.12.0

PyTorchJob #

yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-job
  namespace: kubeflow-user-example-com
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.0
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.0

MPIJob #

yaml
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: mpi-job
  namespace: kubeflow-user-example-com
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - name: mpi-launcher
            image: mpioperator/mpi-operator:latest
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - name: mpi-worker
            image: mpioperator/mpi-operator:latest

XGBoostJob #

yaml
apiVersion: kubeflow.org/v1
kind: XGBoostJob
metadata:
  name: xgboost-job
  namespace: kubeflow-user-example-com
spec:
  xgbReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: xgboost
            image: python:3.9
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: xgboost
            image: python:3.9

通用配置 #

资源配置 #

yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: resource-config-job
  namespace: kubeflow-user-example-com
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.12.0
            resources:
              requests:
                cpu: "4"
                memory: "16Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "8"
                memory: "32Gi"
                nvidia.com/gpu: "1"

存储配置 #

yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: storage-config-job
  namespace: kubeflow-user-example-com
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.0
            volumeMounts:
            - name: data
              mountPath: /data
            - name: output
              mountPath: /output
          volumes:
          - name: data
            persistentVolumeClaim:
              claimName: training-data-pvc
          - name: output
            persistentVolumeClaim:
              claimName: model-output-pvc

环境变量配置 #

yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: env-config-job
  namespace: kubeflow-user-example-com
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.12.0
            env:
            - name: TF_CONFIG
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['kubeflow.org/tf-job-config']
            - name: DATA_PATH
              value: "/data"
            - name: MODEL_PATH
              value: "/output"

调度配置 #

yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: scheduling-job
  namespace: kubeflow-user-example-com
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: accelerator
                    operator: In
                    values:
                    - nvidia-tesla-v100
          tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.0

运行策略 #

清理策略 #

yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: clean-policy-job
  namespace: kubeflow-user-example-com
spec:
  runPolicy:
    cleanPodPolicy: None  # Running, All, None
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.12.0

超时配置 #

yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: timeout-job
  namespace: kubeflow-user-example-com
spec:
  runPolicy:
    activeDeadlineSeconds: 3600  # 1 小时超时
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.0

重启策略 #

yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: restart-policy-job
  namespace: kubeflow-user-example-com
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure  # OnFailure, Never, ExitCode
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.12.0

管理训练作业 #

查看作业状态 #

bash
# 列出所有 TFJob
kubectl get tfjobs -n kubeflow-user-example-com

# 列出所有 PyTorchJob
kubectl get pytorchjobs -n kubeflow-user-example-com

# 查看作业详情
kubectl describe tfjob tensorflow-job -n kubeflow-user-example-com

# 查看作业状态
kubectl get tfjob tensorflow-job -n kubeflow-user-example-com -o jsonpath='{.status.conditions}'

查看训练日志 #

bash
# 查看 Chief/Master 日志
kubectl logs tensorflow-job-chief-0 -n kubeflow-user-example-com

# 查看 Worker 日志
kubectl logs tensorflow-job-worker-0 -n kubeflow-user-example-com

# 实时查看日志
kubectl logs -f tensorflow-job-chief-0 -n kubeflow-user-example-com

# 查看所有 Pod 日志
kubectl logs -l job-name=tensorflow-job -n kubeflow-user-example-com

停止和删除作业 #

bash
# 停止作业
kubectl delete tfjob tensorflow-job -n kubeflow-user-example-com

# 停止 PyTorchJob
kubectl delete pytorchjob pytorch-job -n kubeflow-user-example-com

# 强制删除
kubectl delete tfjob tensorflow-job -n kubeflow-user-example-com --force --grace-period=0

GPU 训练 #

GPU 资源请求 #

yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: gpu-training-job
  namespace: kubeflow-user-example-com
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.12.0-gpu
            resources:
              limits:
                nvidia.com/gpu: 2  # 每个 Worker 2 个 GPU

多节点多 GPU #

yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: multi-node-gpu-job
  namespace: kubeflow-user-example-com
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
            resources:
              limits:
                nvidia.com/gpu: 4
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
            resources:
              limits:
                nvidia.com/gpu: 4

最佳实践 #

训练配置 #

text
1. 资源规划
   ├── 合理设置 Worker 数量
   ├── 根据模型大小配置内存
   └── GPU 资源合理分配

2. 数据管理
   ├── 使用共享存储
   ├── 数据预处理提前完成
   └── 数据加载优化

3. 检查点管理
   ├── 定期保存检查点
   ├── 保存到持久化存储
   └── 支持断点续训

故障处理 #

text
1. 自动恢复
   ├── 配置重启策略
   ├── 检查点恢复
   └── 错误日志记录

2. 监控告警
   ├── 训练进度监控
   ├── 资源使用监控
   └── 异常告警

3. 调试技巧
   ├── 查看日志
   ├── 进入容器调试
   └── 检查事件

下一步 #

现在你已经了解了训练作业的基本概念,接下来学习 TensorFlow 训练,深入了解 TFJob 的配置和使用!

最后更新:2026-04-05