生产环境部署 #

概述 #

本章介绍 Kubeflow 生产环境部署的最佳实践,帮助你构建稳定、可靠、高性能的机器学习平台。

生产环境要求 #

text
┌─────────────────────────────────────────────────────────────┐
│                   生产环境要求                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  高可用性:                                                  │
│  ├── 多副本部署                                            │
│  ├── 故障自动恢复                                          │
│  ├── 负载均衡                                              │
│  └── 数据备份                                              │
│                                                             │
│  可扩展性:                                                  │
│  ├── 水平扩展                                              │
│  ├── 自动扩缩容                                            │
│  ├── 资源弹性                                              │
│  └── 多集群支持                                            │
│                                                             │
│  安全性:                                                    │
│  ├── 认证授权                                              │
│  ├── 网络隔离                                              │
│  ├── 数据加密                                              │
│  └── 审计日志                                              │
│                                                             │
│  可观测性:                                                  │
│  ├── 监控告警                                              │
│  ├── 日志收集                                              │
│  ├── 链路追踪                                              │
│  └── 性能分析                                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

高可用配置 #

Kubeflow 核心组件高可用 #

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-pipeline
  namespace: kubeflow
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-pipeline
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: ml-pipeline
              topologyKey: kubernetes.io/hostname
      containers:
      - name: ml-pipeline
        image: gcr.io/ml-pipeline/api-server:latest
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8888
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8888
          initialDelaySeconds: 5
          periodSeconds: 5

数据库高可用 #

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
  namespace: kubeflow
spec:
  serviceName: mysql-headless
  replicas: 3
  selector:
    matchLabels:
      app: mysql
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: mysql
            topologyKey: kubernetes.io/hostname
      containers:
      - name: mysql
        image: mysql:8.0
        env:
        - name: MYSQL_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-secret
              key: password
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

MinIO 高可用 #

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: minio
  namespace: kubeflow
spec:
  serviceName: minio-headless
  replicas: 4
  selector:
    matchLabels:
      app: minio
  template:
    spec:
      containers:
      - name: minio
        image: minio/minio:latest
        args:
        - server
        - http://minio-{0...3}.minio-headless.kubeflow.svc.cluster.local/data
        - --console-address
        - ":9001"
        env:
        - name: MINIO_ROOT_USER
          value: "minio"
        - name: MINIO_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: minio-secret
              key: password
        ports:
        - containerPort: 9000
        - containerPort: 9001
        volumeMounts:
        - name: data
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 500Gi

性能优化 #

资源配置优化 #

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: resource-limits
  namespace: kubeflow
spec:
  limits:
  - type: Container
    default:
      cpu: "2"
      memory: "4Gi"
    defaultRequest:
      cpu: "100m"
      memory: "256Mi"
    max:
      cpu: "32"
      memory: "128Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
  - type: PersistentVolumeClaim
    max:
      storage: "1Ti"
    min:
      storage: "1Gi"

节点优化 #

yaml
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
  labels:
    accelerator: nvidia-tesla-v100
    node-type: gpu-worker
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
spec:
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

GPU 调度优化 #

yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-training
value: 1000000
globalDefault: false
description: "高优先级训练作业"
---
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: optimized-training
  namespace: kubeflow-user-example-com
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 4
      template:
        spec:
          priorityClassName: high-priority-training
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:latest-gpu
            resources:
              limits:
                nvidia.com/gpu: "4"

运维管理 #

自动化运维 #

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: kubeflow-backup
  namespace: kubeflow
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              kubectl exec -n kubeflow mysql-0 -- mysqldump -u root -p${MYSQL_ROOT_PASSWORD} --all-databases > /backup/mysql-backup.sql
              kubectl exec -n kubeflow minio-0 -- mc mirror /data /backup/minio-backup
          restartPolicy: OnFailure

健康检查 #

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: health-check
  namespace: kubeflow
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: health-check
            image: curlimages/curl:latest
            command:
            - /bin/sh
            - -c
            - |
              curl -f http://ml-pipeline.kubeflow.svc:8888/healthz || exit 1
              curl -f http://centraldashboard.kubeflow.svc:80/healthz || exit 1
          restartPolicy: OnFailure

日志轮转 #

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: logrotate-config
  namespace: kubeflow
data:
  logrotate.conf: |
    /var/log/kubeflow/*.log {
        daily
        rotate 7
        compress
        delaycompress
        missingok
        notifempty
        create 0644 root root
    }

故障恢复 #

备份策略 #

yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: kubeflow-daily-backup
  namespace: velero
spec:
  includedNamespaces:
  - kubeflow
  - istio-system
  - cert-manager
  excludedResources:
  - events
  - pods
  ttl: 720h
  storageLocation: default
  volumeSnapshotLocations:
  - default
  hooks:
    resources:
    - name: pre-backup-hook
      includedNamespaces:
      - kubeflow
      labelSelector:
        matchLabels:
          app: mysql
      pre:
      - exec:
          container: mysql
          command:
          - /bin/sh
          - -c
          - "mysql -u root -p${MYSQL_ROOT_PASSWORD} -e 'FLUSH TABLES WITH READ LOCK;'"
          onError: Continue

恢复流程 #

yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: kubeflow-restore
  namespace: velero
spec:
  backupName: kubeflow-daily-backup
  includedNamespaces:
  - kubeflow
  - istio-system
  restorePVs: true
  hooks:
    resources:
    - name: post-restore-hook
      includedNamespaces:
      - kubeflow
      labelSelector:
        matchLabels:
          app: mysql
      post:
      - exec:
          container: mysql
          command:
          - /bin/sh
          - -c
          - "mysql -u root -p${MYSQL_ROOT_PASSWORD} -e 'UNLOCK TABLES;'"
          onError: Continue

灾难恢复 #

text
灾难恢复步骤:

1. 准备阶段
   ├── 确认备份可用
   ├── 准备恢复环境
   └── 通知相关人员

2. 恢复阶段
   ├── 恢复 Kubernetes 集群
   ├── 恢复 Kubeflow 组件
   ├── 恢复数据存储
   └── 验证服务状态

3. 验证阶段
   ├── 功能测试
   ├── 性能测试
   └── 安全检查

4. 切换阶段
   ├── 更新 DNS
   ├── 验证流量
   └── 监控告警

安全加固 #

安全基线 #

yaml
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod-template
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL

网络安全 #

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: kubeflow-network-policy
  namespace: kubeflow
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: istio-system
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: UDP
      port: 53
  - to:
    - namespaceSelector:
        matchLabels:
          name: kubeflow

监控告警 #

关键指标 #

text
基础设施指标:
├── 节点 CPU/内存使用率
├── 节点磁盘使用率
├── 网络流量
└── GPU 使用率

应用指标:
├── Pipeline 运行成功率
├── 训练作业成功率
├── 模型服务延迟
└── 错误率

业务指标:
├── 用户活跃度
├── 资源使用效率
├── 任务队列长度
└── 成本指标

告警规则 #

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubeflow-production-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubeflow.production
    rules:
    - alert: KubeflowComponentDown
      expr: up{namespace="kubeflow"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Kubeflow component is down"
        description: "{{ $labels.job }} has been down for more than 5 minutes"
    
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5..", namespace="kubeflow"}[5m]) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }} per second"
    
    - alert: GPUClusterUtilizationLow
      expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
      for: 1h
      labels:
        severity: info
      annotations:
        summary: "GPU cluster utilization is low"
        description: "Average GPU utilization is {{ $value }}%"

成本优化 #

资源优化 #

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: kubeflow-user-example-com
spec:
  hard:
    requests.cpu: "50"
    requests.memory: "100Gi"
    limits.cpu: "100"
    limits.memory: "200Gi"
    requests.nvidia.com/gpu: "10"

自动扩缩容 #

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-pipeline-hpa
  namespace: kubeflow
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-pipeline
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

最佳实践清单 #

部署前检查 #

text
□ 集群资源充足
□ 存储配置正确
□ 网络策略配置
□ 安全配置完成
□ 监控系统部署
□ 备份策略配置
□ 告警规则配置
□ 文档完善

运维检查 #

text
□ 定期检查资源使用
□ 定期检查日志
□ 定期检查告警
□ 定期进行备份
□ 定期进行安全审计
□ 定期进行性能测试
□ 定期进行故障演练
□ 定期更新文档

总结 #

通过本章的学习,你已经掌握了 Kubeflow 生产环境部署的关键知识:

  1. 高可用配置 - 确保平台稳定运行
  2. 性能优化 - 提升资源利用效率
  3. 运维管理 - 简化日常运维工作
  4. 故障恢复 - 快速恢复服务
  5. 安全加固 - 保护平台安全
  6. 监控告警 - 及时发现问题
  7. 成本优化 - 控制运营成本

现在你已经完成了 Kubeflow 完全指南的学习,可以开始在实际项目中应用这些知识了!

最后更新:2026-04-05