生产环境部署 #
概述 #
本章介绍 Kubeflow 生产环境部署的最佳实践,帮助你构建稳定、可靠、高性能的机器学习平台。
生产环境要求 #
text
┌─────────────────────────────────────────────────────────────┐
│ 生产环境要求 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 高可用性: │
│ ├── 多副本部署 │
│ ├── 故障自动恢复 │
│ ├── 负载均衡 │
│ └── 数据备份 │
│ │
│ 可扩展性: │
│ ├── 水平扩展 │
│ ├── 自动扩缩容 │
│ ├── 资源弹性 │
│ └── 多集群支持 │
│ │
│ 安全性: │
│ ├── 认证授权 │
│ ├── 网络隔离 │
│ ├── 数据加密 │
│ └── 审计日志 │
│ │
│ 可观测性: │
│ ├── 监控告警 │
│ ├── 日志收集 │
│ ├── 链路追踪 │
│ └── 性能分析 │
│ │
└─────────────────────────────────────────────────────────────┘
高可用配置 #
Kubeflow 核心组件高可用 #
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-pipeline
namespace: kubeflow
spec:
replicas: 3
selector:
matchLabels:
app: ml-pipeline
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: ml-pipeline
topologyKey: kubernetes.io/hostname
containers:
- name: ml-pipeline
image: gcr.io/ml-pipeline/api-server:latest
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
livenessProbe:
httpGet:
path: /healthz
port: 8888
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8888
initialDelaySeconds: 5
periodSeconds: 5
数据库高可用 #
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
namespace: kubeflow
spec:
serviceName: mysql-headless
replicas: 3
selector:
matchLabels:
app: mysql
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mysql
topologyKey: kubernetes.io/hostname
containers:
- name: mysql
image: mysql:8.0
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: password
volumeMounts:
- name: data
mountPath: /var/lib/mysql
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
MinIO 高可用 #
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: minio
namespace: kubeflow
spec:
serviceName: minio-headless
replicas: 4
selector:
matchLabels:
app: minio
template:
spec:
containers:
- name: minio
image: minio/minio:latest
args:
- server
- http://minio-{0...3}.minio-headless.kubeflow.svc.cluster.local/data
- --console-address
- ":9001"
env:
- name: MINIO_ROOT_USER
value: "minio"
- name: MINIO_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: minio-secret
key: password
ports:
- containerPort: 9000
- containerPort: 9001
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 500Gi
性能优化 #
资源配置优化 #
yaml
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: kubeflow
spec:
limits:
- type: Container
default:
cpu: "2"
memory: "4Gi"
defaultRequest:
cpu: "100m"
memory: "256Mi"
max:
cpu: "32"
memory: "128Gi"
min:
cpu: "50m"
memory: "64Mi"
- type: PersistentVolumeClaim
max:
storage: "1Ti"
min:
storage: "1Gi"
节点优化 #
yaml
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
accelerator: nvidia-tesla-v100
node-type: gpu-worker
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
GPU 调度优化 #
yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-training
value: 1000000
globalDefault: false
description: "高优先级训练作业"
---
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: optimized-training
namespace: kubeflow-user-example-com
spec:
tfReplicaSpecs:
Worker:
replicas: 4
template:
spec:
priorityClassName: high-priority-training
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: "4"
运维管理 #
自动化运维 #
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: kubeflow-backup
namespace: kubeflow
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl exec -n kubeflow mysql-0 -- mysqldump -u root -p${MYSQL_ROOT_PASSWORD} --all-databases > /backup/mysql-backup.sql
kubectl exec -n kubeflow minio-0 -- mc mirror /data /backup/minio-backup
restartPolicy: OnFailure
健康检查 #
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: health-check
namespace: kubeflow
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: health-check
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
curl -f http://ml-pipeline.kubeflow.svc:8888/healthz || exit 1
curl -f http://centraldashboard.kubeflow.svc:80/healthz || exit 1
restartPolicy: OnFailure
日志轮转 #
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: logrotate-config
namespace: kubeflow
data:
logrotate.conf: |
/var/log/kubeflow/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
create 0644 root root
}
故障恢复 #
备份策略 #
yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
name: kubeflow-daily-backup
namespace: velero
spec:
includedNamespaces:
- kubeflow
- istio-system
- cert-manager
excludedResources:
- events
- pods
ttl: 720h
storageLocation: default
volumeSnapshotLocations:
- default
hooks:
resources:
- name: pre-backup-hook
includedNamespaces:
- kubeflow
labelSelector:
matchLabels:
app: mysql
pre:
- exec:
container: mysql
command:
- /bin/sh
- -c
- "mysql -u root -p${MYSQL_ROOT_PASSWORD} -e 'FLUSH TABLES WITH READ LOCK;'"
onError: Continue
恢复流程 #
yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
name: kubeflow-restore
namespace: velero
spec:
backupName: kubeflow-daily-backup
includedNamespaces:
- kubeflow
- istio-system
restorePVs: true
hooks:
resources:
- name: post-restore-hook
includedNamespaces:
- kubeflow
labelSelector:
matchLabels:
app: mysql
post:
- exec:
container: mysql
command:
- /bin/sh
- -c
- "mysql -u root -p${MYSQL_ROOT_PASSWORD} -e 'UNLOCK TABLES;'"
onError: Continue
灾难恢复 #
text
灾难恢复步骤:
1. 准备阶段
├── 确认备份可用
├── 准备恢复环境
└── 通知相关人员
2. 恢复阶段
├── 恢复 Kubernetes 集群
├── 恢复 Kubeflow 组件
├── 恢复数据存储
└── 验证服务状态
3. 验证阶段
├── 功能测试
├── 性能测试
└── 安全检查
4. 切换阶段
├── 更新 DNS
├── 验证流量
└── 监控告警
安全加固 #
安全基线 #
yaml
apiVersion: v1
kind: Pod
metadata:
name: secure-pod-template
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
网络安全 #
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kubeflow-network-policy
namespace: kubeflow
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: istio-system
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
- to:
- namespaceSelector:
matchLabels:
name: kubeflow
监控告警 #
关键指标 #
text
基础设施指标:
├── 节点 CPU/内存使用率
├── 节点磁盘使用率
├── 网络流量
└── GPU 使用率
应用指标:
├── Pipeline 运行成功率
├── 训练作业成功率
├── 模型服务延迟
└── 错误率
业务指标:
├── 用户活跃度
├── 资源使用效率
├── 任务队列长度
└── 成本指标
告警规则 #
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubeflow-production-alerts
namespace: monitoring
spec:
groups:
- name: kubeflow.production
rules:
- alert: KubeflowComponentDown
expr: up{namespace="kubeflow"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubeflow component is down"
description: "{{ $labels.job }} has been down for more than 5 minutes"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5..", namespace="kubeflow"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} per second"
- alert: GPUClusterUtilizationLow
expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
for: 1h
labels:
severity: info
annotations:
summary: "GPU cluster utilization is low"
description: "Average GPU utilization is {{ $value }}%"
成本优化 #
资源优化 #
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: kubeflow-user-example-com
spec:
hard:
requests.cpu: "50"
requests.memory: "100Gi"
limits.cpu: "100"
limits.memory: "200Gi"
requests.nvidia.com/gpu: "10"
自动扩缩容 #
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-pipeline-hpa
namespace: kubeflow
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-pipeline
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
最佳实践清单 #
部署前检查 #
text
□ 集群资源充足
□ 存储配置正确
□ 网络策略配置
□ 安全配置完成
□ 监控系统部署
□ 备份策略配置
□ 告警规则配置
□ 文档完善
运维检查 #
text
□ 定期检查资源使用
□ 定期检查日志
□ 定期检查告警
□ 定期进行备份
□ 定期进行安全审计
□ 定期进行性能测试
□ 定期进行故障演练
□ 定期更新文档
总结 #
通过本章的学习,你已经掌握了 Kubeflow 生产环境部署的关键知识:
- 高可用配置 - 确保平台稳定运行
- 性能优化 - 提升资源利用效率
- 运维管理 - 简化日常运维工作
- 故障恢复 - 快速恢复服务
- 安全加固 - 保护平台安全
- 监控告警 - 及时发现问题
- 成本优化 - 控制运营成本
现在你已经完成了 Kubeflow 完全指南的学习,可以开始在实际项目中应用这些知识了!
最后更新:2026-04-05