高级配置 #

概述 #

本章介绍 Kubeflow 的高级配置选项,帮助你在企业环境中更好地管理和优化 Kubeflow 平台。

高级配置内容 #

text
┌─────────────────────────────────────────────────────────────┐
│                    高级配置内容                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  多租户管理:                                                │
│  ├── Profile 管理                                          │
│  ├── 命名空间隔离                                           │
│  ├── 资源配额                                               │
│  └── 权限控制                                               │
│                                                             │
│  资源管理:                                                  │
│  ├── 资源配额配置                                           │
│  ├── 优先级和抢占                                           │
│  ├── 节点调度                                               │
│  └── GPU 调度                                               │
│                                                             │
│  网络配置:                                                  │
│  ├── Istio 服务网格                                         │
│  ├── 网络策略                                               │
│  ├── Ingress 配置                                           │
│  └── DNS 配置                                               │
│                                                             │
│  自定义组件:                                                │
│  ├── 自定义镜像                                             │
│  ├── 自定义组件                                             │
│  ├── 插件开发                                               │
│  └── 集成扩展                                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

多租户管理 #

Profile 概述 #

Profile 是 Kubeflow 多租户管理的核心概念,每个用户或团队对应一个 Profile。

text
Profile 功能:
├── 创建用户命名空间
├── 配置资源配额
├── 设置权限
├── 配置插件
└── 管理生命周期

创建 Profile #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: kubeflow-user-alice
spec:
  owner:
    kind: User
    name: alice@example.com
  plugins:
  - kind: WorkloadIdentity
    spec:
      gcpServiceAccount: alice-sa@project.iam.gserviceaccount.com
  resourceQuotaSpec:
    hard:
      cpu: "100"
      memory: 200Gi
      nvidia.com/gpu: "10"
      persistentvolumeclaims: "20"
      requests.nvidia.com/gpu: "10"

Profile 配置详解 #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: team-data-science
spec:
  owner:
    kind: User
    name: team-lead@example.com
  
  plugins:
  - kind: WorkloadIdentity
    spec:
      gcpServiceAccount: team-sa@project.iam.gserviceaccount.com
  
  resourceQuotaSpec:
    hard:
      cpu: "200"
      memory: 400Gi
      nvidia.com/gpu: "20"
      persistentvolumeclaims: "50"
      pods: "100"
      services: "50"
      secrets: "100"
      configmaps: "100"
  
  defaultPodSecurityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000

管理用户权限 #

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: user-edit
  namespace: kubeflow-user-alice
subjects:
- kind: User
  name: alice@example.com
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: kubeflow-edit
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: user-view
  namespace: kubeflow-user-alice
subjects:
- kind: User
  name: bob@example.com
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: kubeflow-view
  apiGroup: rbac.authorization.k8s.io

资源配额管理 #

命名空间配额 #

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
  namespace: kubeflow-user-alice
spec:
  hard:
    requests.cpu: "50"
    requests.memory: 100Gi
    limits.cpu: "100"
    limits.memory: 200Gi
    requests.nvidia.com/gpu: "10"
    pods: "50"
    persistentvolumeclaims: "20"
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: object-counts
  namespace: kubeflow-user-alice
spec:
  hard:
    configmaps: "50"
    secrets: "50"
    services: "20"
    replicationcontrollers: "10"
    count/jobs.batch: "20"
    count/cronjobs.batch: "10"

LimitRange 配置 #

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: resource-limits
  namespace: kubeflow-user-alice
spec:
  limits:
  - type: Container
    default:
      cpu: "2"
      memory: "4Gi"
    defaultRequest:
      cpu: "100m"
      memory: "256Mi"
    max:
      cpu: "16"
      memory: "64Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
  - type: PersistentVolumeClaim
    max:
      storage: "500Gi"
    min:
      storage: "1Gi"

优先级和抢占 #

yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "高优先级训练作业"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: true
description: "低优先级开发作业"

网络配置 #

Istio 服务网格 #

yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: kubeflow-dashboard
  namespace: istio-system
spec:
  gateways:
  - kubeflow-gateway
  hosts:
  - "*"
  http:
  - match:
    - uri:
        prefix: /
    route:
    - destination:
        host: centraldashboard
        port:
          number: 80

网络策略 #

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: kubeflow-user-alice
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-istio
  namespace: kubeflow-user-alice
spec:
  podSelector: {}
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: istio-system
  egress:
  - to:
    - namespaceSelector: {}

Ingress 配置 #

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: kubeflow-ingress
  namespace: istio-system
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
  - hosts:
    - kubeflow.example.com
    secretName: kubeflow-tls
  rules:
  - host: kubeflow.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: istio-ingressgateway
            port:
              number: 80

GPU 调度 #

GPU 资源管理 #

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: kubeflow-user-alice
spec:
  hard:
    requests.nvidia.com/gpu: "10"
    limits.nvidia.com/gpu: "10"

GPU 节点选择 #

yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: gpu-training
  namespace: kubeflow-user-alice
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 4
      template:
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: accelerator
                    operator: In
                    values:
                    - nvidia-tesla-v100
                    - nvidia-tesla-a100
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:latest-gpu
            resources:
              limits:
                nvidia.com/gpu: 2

GPU 时间切片 #

yaml
apiVersion: deviceplugin.nvidia.com/v1beta1
kind: DeviceConfig
metadata:
  name: gpu-time-slicing
spec:
  version: v1
  flags:
    migStrategy: none
    failOnInitError: true
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
  resources:
  - name: nvidia.com/gpu
    replicas: 4

自定义镜像 #

构建自定义 Notebook 镜像 #

dockerfile
FROM public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow:v1.8.0

USER root

RUN apt-get update && apt-get install -y \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir \
    pandas \
    numpy \
    scikit-learn \
    matplotlib \
    seaborn

COPY custom-scripts/ /opt/custom-scripts/

USER jovyan

构建自定义训练镜像 #

dockerfile
FROM tensorflow/tensorflow:2.12.0-gpu

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ /app/src/

ENV PYTHONPATH=/app

ENTRYPOINT ["python", "/app/src/train.py"]

自定义组件 #

自定义 Pipeline 组件 #

python
from kfp import dsl
from kfp.dsl import Output, Input, Artifact

@dsl.component(
    base_image='python:3.9',
    packages_to_install=['my-custom-package']
)
def custom_component(
    input_data: Input[Artifact],
    output_data: Output[Artifact],
    param: str = 'default'
):
    from my_custom_package import process
    
    result = process(input_data.path, param)
    result.save(output_data.path)

自定义 Katib 算法 #

python
from kubeflow.pytorchjob import PyTorchJob
from kubeflow.katib import KatibClient

def custom_suggestion_algorithm(experiment):
    suggestions = []
    for _ in range(experiment.spec.parallelTrialCount):
        params = generate_custom_suggestions(experiment)
        suggestions.append(params)
    return suggestions

配置管理 #

ConfigMap 管理 #

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubeflow-config
  namespace: kubeflow
data:
  default-editor-service-account: default-editor
  default-viewer-service-account: default-viewer
  pod-defaults: |
    {
      "apiVersion": "kubeflow.org/v1alpha1",
      "kind": "PodDefault",
      "metadata": {
        "name": "default-config"
      },
      "spec": {
        "desc": "Default configuration",
        "env": [
          {"name": "LOG_LEVEL", "value": "INFO"}
        ]
      }
    }

Secret 管理 #

yaml
apiVersion: v1
kind: Secret
metadata:
  name: mlflow-credentials
  namespace: kubeflow-user-alice
type: Opaque
stringData:
  username: mlflow-user
  password: mlflow-password
  tracking-uri: https://mlflow.example.com

性能调优 #

节点调优 #

yaml
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
  labels:
    accelerator: nvidia-tesla-v100
    node.kubernetes.io/instance-type: p3.2xlarge
  annotations:
    node.alpha.kubernetes.io/ttl: "0"

调度器配置 #

yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
      - name: NodeResourcesBalancedAllocation

备份和恢复 #

备份配置 #

yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: kubeflow-backup
  namespace: velero
spec:
  includedNamespaces:
  - kubeflow
  - kubeflow-user-alice
  - istio-system
  excludedResources:
  - events
  - pods
  ttl: 720h
  storageLocation: default
  volumeSnapshotLocations:
  - default

恢复配置 #

yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: kubeflow-restore
  namespace: velero
spec:
  backupName: kubeflow-backup
  includedNamespaces:
  - kubeflow
  - kubeflow-user-alice

最佳实践 #

配置管理 #

text
1. 版本控制
   ├── 使用 Git 管理 YAML 配置
   ├── 使用 Kustomize 管理变体
   └── 记录变更历史

2. 环境隔离
   ├── 开发/测试/生产环境分离
   ├── 使用不同的命名空间
   └── 配置不同的资源配额

3. 安全配置
   ├── 最小权限原则
   ├── 定期审计权限
   └── 敏感信息使用 Secret

性能优化 #

text
1. 资源优化
   ├── 合理设置资源请求
   ├── 使用 LimitRange 限制范围
   └── 监控资源使用

2. 调度优化
   ├── 使用节点亲和性
   ├── 配置优先级
   └── 优化 GPU 调度

3. 网络优化
   ├── 配置网络策略
   ├── 优化 Istio 配置
   └── 使用缓存

下一步 #

现在你已经掌握了高级配置,接下来学习 安全配置,了解 Kubeflow 的安全最佳实践!

最后更新:2026-04-05