Katib 超参数调优 #
概述 #
Katib 是 Kubeflow 的自动化机器学习(AutoML)组件,提供超参数优化、神经架构搜索和模型压缩等功能。
核心功能 #
text
┌─────────────────────────────────────────────────────────────┐
│ Katib 核心功能 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 超参数优化: │
│ ├── 自动搜索最优超参数 │
│ ├── 支持多种优化算法 │
│ ├── 并行试验执行 │
│ └── 早停策略 │
│ │
│ 神经架构搜索: │
│ ├── 自动搜索网络架构 │
│ ├── 支持多种 NAS 算法 │
│ └── 架构评估 │
│ │
│ 模型压缩: │
│ ├── 剪枝 │
│ ├── 量化 │
│ └── 知识蒸馏 │
│ │
└─────────────────────────────────────────────────────────────┘
核心概念 #
Experiment(实验) #
Experiment 定义了优化任务的整体配置。
text
Experiment 组成:
├── Objective - 优化目标
│ ├── type: maximize/minimize
│ ├── goal: 目标值
│ └── objectiveMetricName: 指标名称
│
├── Algorithm - 优化算法
│ ├── algorithmName: 算法类型
│ └── algorithmSettings: 算法参数
│
├── Parameters - 搜索空间
│ ├── parameterType: 参数类型
│ └── feasibleSpace: 取值范围
│
└── Trial Template - 试验模板
├── WorkerSpec: 工作负载定义
└── MetricsCollectorSpec: 指标收集
Trial(试验) #
Trial 是 Experiment 的一次具体执行,使用特定的超参数组合。
text
Trial 生命周期:
├── Created - 创建试验
├── Running - 运行中
├── Succeeded - 成功完成
├── Failed - 失败
└── EarlyStopped - 早停
Suggestion(建议) #
Suggestion 是优化算法生成的超参数建议。
text
Suggestion 工作流程:
├── Experiment 创建 Suggestion
├── Suggestion 调用优化算法
├── 生成超参数建议
├── 创建 Trial 执行
└── 收集结果,继续优化
支持的优化算法 #
算法概览 #
text
┌─────────────────────────────────────────────────────────────┐
│ Katib 支持的算法 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 黑盒优化算法: │
│ ├── Random Search - 随机搜索 │
│ ├── Grid Search - 网格搜索 │
│ └── Bayesian Optimization - 贝叶斯优化 │
│ │
│ 多保真度算法: │
│ ├── Hyperband - 超带优化 │
│ └── Successive Halving - 连续减半 │
│ │
│ 神经架构搜索: │
│ ├── ENAS - 高效神经架构搜索 │
│ └── DARTS - 可微分架构搜索 │
│ │
│ 其他算法: │
│ ├── TPE - 树结构 Parzen 估计器 │
│ ├── CMA-ES - 协方差矩阵自适应进化策略 │
│ └── Sobol - Sobol 序列 │
│ │
└─────────────────────────────────────────────────────────────┘
Random Search #
随机搜索是最简单的优化算法,在搜索空间中随机采样。
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: random-search
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
- name: batch_size
parameterType: int
feasibleSpace:
min: "16"
max: "128"
Grid Search #
网格搜索穷举搜索空间中的所有组合。
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: grid-search
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
objectiveMetricName: accuracy
algorithm:
algorithmName: grid
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
list:
- "0.001"
- "0.01"
- "0.1"
- name: batch_size
parameterType: int
feasibleSpace:
list:
- "16"
- "32"
- "64"
Bayesian Optimization #
贝叶斯优化使用高斯过程建模目标函数。
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: bayesian-optimization
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
goal: 0.98
objectiveMetricName: accuracy
algorithm:
algorithmName: bayesianoptimization
algorithmSettings:
- name: random_state
value: "42"
- name: n_initial_points
value: "5"
parallelTrialCount: 2
maxTrialCount: 20
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.0001"
max: "0.1"
- name: hidden_units
parameterType: int
feasibleSpace:
min: "32"
max: "512"
Hyperband #
Hyperband 是一种高效的多保真度优化算法。
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: hyperband
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
objectiveMetricName: accuracy
algorithm:
algorithmName: hyperband
algorithmSettings:
- name: eta
value: "3"
- name: resource_name
value: "epochs"
- name: resource_type
value: "int"
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
trialTemplate:
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training
image: python:3.9
command:
- python
- -c
- |
import sys
learning_rate = float(sys.argv[1])
epochs = int(sys.argv[2])
print(f"Training with lr={learning_rate}, epochs={epochs}")
print(f"accuracy={0.8 + 0.1 * (1 - learning_rate)}")
- "${trialParameters.learning_rate}"
- "${trialParameters.resources}"
TPE (Tree-structured Parzen Estimator) #
TPE 是一种高效的贝叶斯优化变体。
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: tpe-optimization
namespace: kubeflow-user-example-com
spec:
objective:
type: minimize
objectiveMetricName: loss
algorithm:
algorithmName: tpe
algorithmSettings:
- name: n_startup_trials
value: "10"
- name: n_ei_candidates
value: "24"
parallelTrialCount: 2
maxTrialCount: 30
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.0001"
max: "0.1"
- name: weight_decay
parameterType: double
feasibleSpace:
min: "0.00001"
max: "0.01"
创建超参数优化实验 #
基本实验配置 #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: mnist-tuning
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
additionalMetricNames:
- loss
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
step: "0.001"
- name: batch_size
parameterType: int
feasibleSpace:
min: "16"
max: "128"
- name: hidden_units
parameterType: categorical
feasibleSpace:
list:
- "64"
- "128"
- "256"
trialTemplate:
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training
image: tensorflow/tensorflow:2.12.0
command:
- python
- -c
- |
import tensorflow as tf
import numpy as np
learning_rate = float("${trialParameters.learning_rate}")
batch_size = int("${trialParameters.batch_size}")
hidden_units = int("${trialParameters.hidden_units}")
print(f"Training with lr={learning_rate}, batch={batch_size}, units={hidden_units}")
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(hidden_units, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=batch_size, verbose=0)
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f"loss={loss}")
print(f"accuracy={accuracy}")
restartPolicy: Never
trialParameters:
- name: learning_rate
reference: learning_rate
- name: batch_size
reference: batch_size
- name: hidden_units
reference: hidden_units
使用 TFJob 作为 Trial #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: tfjob-tuning
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
goal: 0.98
objectiveMetricName: accuracy
algorithm:
algorithmName: bayesianoptimization
parallelTrialCount: 2
maxTrialCount: 10
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.0001"
max: "0.01"
trialTemplate:
trialSpec:
apiVersion: kubeflow.org/v1
kind: TFJob
spec:
tfReplicaSpecs:
Chief:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.12.0
command:
- python
- /opt/model/train.py
- --learning-rate=${trialParameters.learning_rate}
resources:
limits:
nvidia.com/gpu: 1
trialParameters:
- name: learning_rate
reference: learning_rate
使用 PyTorchJob 作为 Trial #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: pytorchjob-tuning
namespace: kubeflow-user-example-com
spec:
objective:
type: minimize
objectiveMetricName: loss
algorithm:
algorithmName: hyperband
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
trialTemplate:
trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command:
- python
- /opt/model/train.py
- --lr=${trialParameters.lr}
trialParameters:
- name: lr
reference: lr
参数类型 #
数值参数 #
yaml
parameters:
# 浮点数参数
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.0001"
max: "0.1"
step: "0.0001" # 可选步长
# 整数参数
- name: batch_size
parameterType: int
feasibleSpace:
min: "16"
max: "256"
step: "16" # 可选步长
分类参数 #
yaml
parameters:
# 分类参数
- name: optimizer
parameterType: categorical
feasibleSpace:
list:
- "adam"
- "sgd"
- "rmsprop"
# 离散参数
- name: hidden_units
parameterType: discrete
feasibleSpace:
list:
- "64"
- "128"
- "256"
- "512"
早停策略 #
配置早停 #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: early-stopping-example
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
objectiveMetricName: accuracy
algorithm:
algorithmName: random
earlyStopping:
algorithmName: medianstop
algorithmSettings:
- name: min_trials_required
value: "3"
- name: start_step
value: "5"
parallelTrialCount: 3
maxTrialCount: 20
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
支持的早停算法 #
text
早停算法:
├── medianstop - 中位数停止
├── enas - ENAS 早停
└── hyperband - Hyperband 早停
指标收集 #
File Metrics Collector #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: file-metrics
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
objectiveMetricName: accuracy
algorithm:
algorithmName: random
metricsCollectorSpec:
collector:
kind: File
source:
fileSystemPath:
path: /output/metrics.txt
format: text
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
trialTemplate:
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training
image: python:3.9
command:
- sh
- -c
- |
python train.py --lr ${trialParameters.lr}
echo "accuracy=0.95" > /output/metrics.txt
Prometheus Metrics Collector #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: prometheus-metrics
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
objectiveMetricName: accuracy
algorithm:
algorithmName: random
metricsCollectorSpec:
collector:
kind: Prometheus
source:
httpGet:
path: /metrics
port: 8080
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
Custom Metrics Collector #
yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: custom-metrics
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
objectiveMetricName: accuracy
algorithm:
algorithmName: random
metricsCollectorSpec:
collector:
kind: Custom
customCollector:
container:
image: python:3.9
command:
- python
- /scripts/collect_metrics.py
args:
- --trial-name=$(TRIAL_NAME)
- --namespace=$(NAMESPACE)
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
管理实验 #
查看实验状态 #
bash
# 列出所有实验
kubectl get experiments -n kubeflow-user-example-com
# 查看实验详情
kubectl describe experiment mnist-tuning -n kubeflow-user-example-com
# 查看实验状态
kubectl get experiment mnist-tuning -n kubeflow-user-example-com -o jsonpath='{.status}'
# 查看试验列表
kubectl get trials -n kubeflow-user-example-com -l experiment=mnist-tuning
查看试验结果 #
bash
# 查看试验详情
kubectl describe trial mnist-tuning-xxxx -n kubeflow-user-example-com
# 查看试验指标
kubectl get trial mnist-tuning-xxxx -n kubeflow-user-example-com -o jsonpath='{.status.observation.metrics}'
# 查看 Suggestion
kubectl get suggestions -n kubeflow-user-example-com
停止和删除实验 #
bash
# 停止实验(将 maxTrialCount 设为当前 trial 数量)
kubectl patch experiment mnist-tuning -n kubeflow-user-example-com --type merge -p '{"spec":{"maxTrialCount":0}}'
# 删除实验
kubectl delete experiment mnist-tuning -n kubeflow-user-example-com
最佳实践 #
实验设计 #
text
1. 搜索空间设计
├── 从大范围开始,逐步缩小
├── 使用对数尺度搜索学习率
└── 合理设置参数边界
2. 算法选择
├── 小规模实验:Random Search
├── 中等规模:Bayesian Optimization
└── 大规模:Hyperband
3. 并行配置
├── 根据资源设置并行数
├── Bayesian 不宜过多并行
└── Random Search 可高并行
4. 早停策略
├── 长时间训练使用早停
├── 合理设置开始步数
└── 避免过早停止
资源管理 #
text
1. 资源配置
├── 设置资源请求和限制
├── 使用 GPU 时注意调度
└── 合理设置并行数
2. 存储管理
├── 使用持久化存储
├── 清理不需要的工件
└── 共享数据集
3. 时间管理
├── 设置合理的最大试验数
├── 使用早停节省时间
└── 记录实验结果
下一步 #
现在你已经掌握了 Katib 超参数调优,接下来学习 模型服务,了解如何部署和管理模型服务!
最后更新:2026-04-05