高级配置 #
概述 #
本章介绍 Kubeflow 的高级配置选项,帮助你在企业环境中更好地管理和优化 Kubeflow 平台。
高级配置内容 #
text
┌─────────────────────────────────────────────────────────────┐
│ 高级配置内容 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 多租户管理: │
│ ├── Profile 管理 │
│ ├── 命名空间隔离 │
│ ├── 资源配额 │
│ └── 权限控制 │
│ │
│ 资源管理: │
│ ├── 资源配额配置 │
│ ├── 优先级和抢占 │
│ ├── 节点调度 │
│ └── GPU 调度 │
│ │
│ 网络配置: │
│ ├── Istio 服务网格 │
│ ├── 网络策略 │
│ ├── Ingress 配置 │
│ └── DNS 配置 │
│ │
│ 自定义组件: │
│ ├── 自定义镜像 │
│ ├── 自定义组件 │
│ ├── 插件开发 │
│ └── 集成扩展 │
│ │
└─────────────────────────────────────────────────────────────┘
多租户管理 #
Profile 概述 #
Profile 是 Kubeflow 多租户管理的核心概念,每个用户或团队对应一个 Profile。
text
Profile 功能:
├── 创建用户命名空间
├── 配置资源配额
├── 设置权限
├── 配置插件
└── 管理生命周期
创建 Profile #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
name: kubeflow-user-alice
spec:
owner:
kind: User
name: alice@example.com
plugins:
- kind: WorkloadIdentity
spec:
gcpServiceAccount: alice-sa@project.iam.gserviceaccount.com
resourceQuotaSpec:
hard:
cpu: "100"
memory: 200Gi
nvidia.com/gpu: "10"
persistentvolumeclaims: "20"
requests.nvidia.com/gpu: "10"
Profile 配置详解 #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
name: team-data-science
spec:
owner:
kind: User
name: team-lead@example.com
plugins:
- kind: WorkloadIdentity
spec:
gcpServiceAccount: team-sa@project.iam.gserviceaccount.com
resourceQuotaSpec:
hard:
cpu: "200"
memory: 400Gi
nvidia.com/gpu: "20"
persistentvolumeclaims: "50"
pods: "100"
services: "50"
secrets: "100"
configmaps: "100"
defaultPodSecurityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
管理用户权限 #
yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: user-edit
namespace: kubeflow-user-alice
subjects:
- kind: User
name: alice@example.com
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: kubeflow-edit
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: user-view
namespace: kubeflow-user-alice
subjects:
- kind: User
name: bob@example.com
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: kubeflow-view
apiGroup: rbac.authorization.k8s.io
资源配额管理 #
命名空间配额 #
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: kubeflow-user-alice
spec:
hard:
requests.cpu: "50"
requests.memory: 100Gi
limits.cpu: "100"
limits.memory: 200Gi
requests.nvidia.com/gpu: "10"
pods: "50"
persistentvolumeclaims: "20"
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-counts
namespace: kubeflow-user-alice
spec:
hard:
configmaps: "50"
secrets: "50"
services: "20"
replicationcontrollers: "10"
count/jobs.batch: "20"
count/cronjobs.batch: "10"
LimitRange 配置 #
yaml
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: kubeflow-user-alice
spec:
limits:
- type: Container
default:
cpu: "2"
memory: "4Gi"
defaultRequest:
cpu: "100m"
memory: "256Mi"
max:
cpu: "16"
memory: "64Gi"
min:
cpu: "50m"
memory: "64Mi"
- type: PersistentVolumeClaim
max:
storage: "500Gi"
min:
storage: "1Gi"
优先级和抢占 #
yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "高优先级训练作业"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1000
globalDefault: true
description: "低优先级开发作业"
网络配置 #
Istio 服务网格 #
yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: kubeflow-dashboard
namespace: istio-system
spec:
gateways:
- kubeflow-gateway
hosts:
- "*"
http:
- match:
- uri:
prefix: /
route:
- destination:
host: centraldashboard
port:
number: 80
网络策略 #
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
namespace: kubeflow-user-alice
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-istio
namespace: kubeflow-user-alice
spec:
podSelector: {}
ingress:
- from:
- namespaceSelector:
matchLabels:
name: istio-system
egress:
- to:
- namespaceSelector: {}
Ingress 配置 #
yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: kubeflow-ingress
namespace: istio-system
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- kubeflow.example.com
secretName: kubeflow-tls
rules:
- host: kubeflow.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: istio-ingressgateway
port:
number: 80
GPU 调度 #
GPU 资源管理 #
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: kubeflow-user-alice
spec:
hard:
requests.nvidia.com/gpu: "10"
limits.nvidia.com/gpu: "10"
GPU 节点选择 #
yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: gpu-training
namespace: kubeflow-user-alice
spec:
tfReplicaSpecs:
Worker:
replicas: 4
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-tesla-v100
- nvidia-tesla-a100
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 2
GPU 时间切片 #
yaml
apiVersion: deviceplugin.nvidia.com/v1beta1
kind: DeviceConfig
metadata:
name: gpu-time-slicing
spec:
version: v1
flags:
migStrategy: none
failOnInitError: true
deviceListStrategy: envvar
deviceIDStrategy: uuid
resources:
- name: nvidia.com/gpu
replicas: 4
自定义镜像 #
构建自定义 Notebook 镜像 #
dockerfile
FROM public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow:v1.8.0
USER root
RUN apt-get update && apt-get install -y \
git \
curl \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir \
pandas \
numpy \
scikit-learn \
matplotlib \
seaborn
COPY custom-scripts/ /opt/custom-scripts/
USER jovyan
构建自定义训练镜像 #
dockerfile
FROM tensorflow/tensorflow:2.12.0-gpu
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ /app/src/
ENV PYTHONPATH=/app
ENTRYPOINT ["python", "/app/src/train.py"]
自定义组件 #
自定义 Pipeline 组件 #
python
from kfp import dsl
from kfp.dsl import Output, Input, Artifact
@dsl.component(
base_image='python:3.9',
packages_to_install=['my-custom-package']
)
def custom_component(
input_data: Input[Artifact],
output_data: Output[Artifact],
param: str = 'default'
):
from my_custom_package import process
result = process(input_data.path, param)
result.save(output_data.path)
自定义 Katib 算法 #
python
from kubeflow.pytorchjob import PyTorchJob
from kubeflow.katib import KatibClient
def custom_suggestion_algorithm(experiment):
suggestions = []
for _ in range(experiment.spec.parallelTrialCount):
params = generate_custom_suggestions(experiment)
suggestions.append(params)
return suggestions
配置管理 #
ConfigMap 管理 #
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kubeflow-config
namespace: kubeflow
data:
default-editor-service-account: default-editor
default-viewer-service-account: default-viewer
pod-defaults: |
{
"apiVersion": "kubeflow.org/v1alpha1",
"kind": "PodDefault",
"metadata": {
"name": "default-config"
},
"spec": {
"desc": "Default configuration",
"env": [
{"name": "LOG_LEVEL", "value": "INFO"}
]
}
}
Secret 管理 #
yaml
apiVersion: v1
kind: Secret
metadata:
name: mlflow-credentials
namespace: kubeflow-user-alice
type: Opaque
stringData:
username: mlflow-user
password: mlflow-password
tracking-uri: https://mlflow.example.com
性能调优 #
节点调优 #
yaml
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
accelerator: nvidia-tesla-v100
node.kubernetes.io/instance-type: p3.2xlarge
annotations:
node.alpha.kubernetes.io/ttl: "0"
调度器配置 #
yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
- name: NodeResourcesBalancedAllocation
备份和恢复 #
备份配置 #
yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
name: kubeflow-backup
namespace: velero
spec:
includedNamespaces:
- kubeflow
- kubeflow-user-alice
- istio-system
excludedResources:
- events
- pods
ttl: 720h
storageLocation: default
volumeSnapshotLocations:
- default
恢复配置 #
yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
name: kubeflow-restore
namespace: velero
spec:
backupName: kubeflow-backup
includedNamespaces:
- kubeflow
- kubeflow-user-alice
最佳实践 #
配置管理 #
text
1. 版本控制
├── 使用 Git 管理 YAML 配置
├── 使用 Kustomize 管理变体
└── 记录变更历史
2. 环境隔离
├── 开发/测试/生产环境分离
├── 使用不同的命名空间
└── 配置不同的资源配额
3. 安全配置
├── 最小权限原则
├── 定期审计权限
└── 敏感信息使用 Secret
性能优化 #
text
1. 资源优化
├── 合理设置资源请求
├── 使用 LimitRange 限制范围
└── 监控资源使用
2. 调度优化
├── 使用节点亲和性
├── 配置优先级
└── 优化 GPU 调度
3. 网络优化
├── 配置网络策略
├── 优化 Istio 配置
└── 使用缓存
下一步 #
现在你已经掌握了高级配置,接下来学习 安全配置,了解 Kubeflow 的安全最佳实践!
最后更新:2026-04-05