训练作业概述 #
概述 #
Kubeflow Training Operator 提供了在 Kubernetes 上运行分布式机器学习训练的能力,支持多种主流深度学习框架。
核心功能 #
text
┌─────────────────────────────────────────────────────────────┐
│ Training Operator 功能 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 分布式训练: │
│ ├── 多节点多 GPU 训练 │
│ ├── 数据并行 │
│ ├── 模型并行 │
│ └── 混合并行 │
│ │
│ 框架支持: │
│ ├── TensorFlow (TFJob) │
│ ├── PyTorch (PyTorchJob) │
│ ├── MPI/Horovod (MPIJob) │
│ ├── XGBoost (XGBoostJob) │
│ └── MXNet (MXJob) │
│ │
│ 资源管理: │
│ ├── GPU 调度 │
│ ├── 资源配额 │
│ ├── 故障恢复 │
│ └── 自动重启 │
│ │
└─────────────────────────────────────────────────────────────┘
核心概念 #
Training Operator #
Training Operator 是 Kubeflow 的核心组件,负责管理各种训练作业的 CRD。
text
Training Operator 架构:
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Training Operator │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ TFJob │ │PyTorch │ │ MPIJob │ │XGBoost │ │ │
│ │ │Controller│ │Controller│ │Controller│ │Controller│ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Kubernetes API Server │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Training Pods │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Master │ │ Worker │ │ Worker │ │ │
│ │ │ Pod │ │ Pod 1 │ │ Pod 2 │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
分布式训练模式 #
text
┌─────────────────────────────────────────────────────────────┐
│ 分布式训练模式 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 参数服务器模式 (Parameter Server): │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ PS 1 │ │ PS 2 │ │ PS 3 │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └───────────────┼───────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │
│ │Worker 1 │ │Worker 2 │ │Worker 3 │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ AllReduce 模式: │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Worker 1 │←───→│Worker 2 │←───→│Worker 3 │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ ↑ ↑ ↑ │
│ └───────────────┴───────────────┘ │
│ 环形通信 │
│ │
└─────────────────────────────────────────────────────────────┘
Job 角色 #
text
训练作业角色:
Master/Chief:
├── 协调训练过程
├── 保存检查点
├── 写入日志
└── 主节点
Worker:
├── 执行训练计算
├── 处理数据批次
├── 更新模型参数
└── 工作节点
Parameter Server (PS):
├── 存储模型参数
├── 参数聚合
├── 参数分发
└── 参数服务器
Evaluator:
├── 模型评估
├── 计算指标
└── 验证节点
支持的训练作业类型 #
TFJob #
yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tensorflow-job
namespace: kubeflow-user-example-com
spec:
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.12.0
Worker:
replicas: 3
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.12.0
PyTorchJob #
yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-job
namespace: kubeflow-user-example-com
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0
Worker:
replicas: 3
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0
MPIJob #
yaml
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: mpi-job
namespace: kubeflow-user-example-com
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: mpi-launcher
image: mpioperator/mpi-operator:latest
Worker:
replicas: 4
template:
spec:
containers:
- name: mpi-worker
image: mpioperator/mpi-operator:latest
XGBoostJob #
yaml
apiVersion: kubeflow.org/v1
kind: XGBoostJob
metadata:
name: xgboost-job
namespace: kubeflow-user-example-com
spec:
xgbReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: xgboost
image: python:3.9
Worker:
replicas: 3
template:
spec:
containers:
- name: xgboost
image: python:3.9
通用配置 #
资源配置 #
yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: resource-config-job
namespace: kubeflow-user-example-com
spec:
tfReplicaSpecs:
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.12.0
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
存储配置 #
yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: storage-config-job
namespace: kubeflow-user-example-com
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0
volumeMounts:
- name: data
mountPath: /data
- name: output
mountPath: /output
volumes:
- name: data
persistentVolumeClaim:
claimName: training-data-pvc
- name: output
persistentVolumeClaim:
claimName: model-output-pvc
环境变量配置 #
yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: env-config-job
namespace: kubeflow-user-example-com
spec:
tfReplicaSpecs:
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.12.0
env:
- name: TF_CONFIG
valueFrom:
fieldRef:
fieldPath: metadata.annotations['kubeflow.org/tf-job-config']
- name: DATA_PATH
value: "/data"
- name: MODEL_PATH
value: "/output"
调度配置 #
yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: scheduling-job
namespace: kubeflow-user-example-com
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-tesla-v100
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0
运行策略 #
清理策略 #
yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: clean-policy-job
namespace: kubeflow-user-example-com
spec:
runPolicy:
cleanPodPolicy: None # Running, All, None
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.12.0
超时配置 #
yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: timeout-job
namespace: kubeflow-user-example-com
spec:
runPolicy:
activeDeadlineSeconds: 3600 # 1 小时超时
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0
重启策略 #
yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: restart-policy-job
namespace: kubeflow-user-example-com
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure # OnFailure, Never, ExitCode
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.12.0
管理训练作业 #
查看作业状态 #
bash
# 列出所有 TFJob
kubectl get tfjobs -n kubeflow-user-example-com
# 列出所有 PyTorchJob
kubectl get pytorchjobs -n kubeflow-user-example-com
# 查看作业详情
kubectl describe tfjob tensorflow-job -n kubeflow-user-example-com
# 查看作业状态
kubectl get tfjob tensorflow-job -n kubeflow-user-example-com -o jsonpath='{.status.conditions}'
查看训练日志 #
bash
# 查看 Chief/Master 日志
kubectl logs tensorflow-job-chief-0 -n kubeflow-user-example-com
# 查看 Worker 日志
kubectl logs tensorflow-job-worker-0 -n kubeflow-user-example-com
# 实时查看日志
kubectl logs -f tensorflow-job-chief-0 -n kubeflow-user-example-com
# 查看所有 Pod 日志
kubectl logs -l job-name=tensorflow-job -n kubeflow-user-example-com
停止和删除作业 #
bash
# 停止作业
kubectl delete tfjob tensorflow-job -n kubeflow-user-example-com
# 停止 PyTorchJob
kubectl delete pytorchjob pytorch-job -n kubeflow-user-example-com
# 强制删除
kubectl delete tfjob tensorflow-job -n kubeflow-user-example-com --force --grace-period=0
GPU 训练 #
GPU 资源请求 #
yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: gpu-training-job
namespace: kubeflow-user-example-com
spec:
tfReplicaSpecs:
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.12.0-gpu
resources:
limits:
nvidia.com/gpu: 2 # 每个 Worker 2 个 GPU
多节点多 GPU #
yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: multi-node-gpu-job
namespace: kubeflow-user-example-com
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: 4
Worker:
replicas: 4
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: 4
最佳实践 #
训练配置 #
text
1. 资源规划
├── 合理设置 Worker 数量
├── 根据模型大小配置内存
└── GPU 资源合理分配
2. 数据管理
├── 使用共享存储
├── 数据预处理提前完成
└── 数据加载优化
3. 检查点管理
├── 定期保存检查点
├── 保存到持久化存储
└── 支持断点续训
故障处理 #
text
1. 自动恢复
├── 配置重启策略
├── 检查点恢复
└── 错误日志记录
2. 监控告警
├── 训练进度监控
├── 资源使用监控
└── 异常告警
3. 调试技巧
├── 查看日志
├── 进入容器调试
└── 检查事件
下一步 #
现在你已经了解了训练作业的基本概念,接下来学习 TensorFlow 训练,深入了解 TFJob 的配置和使用!
最后更新:2026-04-05