Katib 超参数调优 #

概述 #

Katib 是 Kubeflow 的自动化机器学习(AutoML)组件,提供超参数优化、神经架构搜索和模型压缩等功能。

核心功能 #

text
┌─────────────────────────────────────────────────────────────┐
│                    Katib 核心功能                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  超参数优化:                                                │
│  ├── 自动搜索最优超参数                                      │
│  ├── 支持多种优化算法                                        │
│  ├── 并行试验执行                                           │
│  └── 早停策略                                               │
│                                                             │
│  神经架构搜索:                                              │
│  ├── 自动搜索网络架构                                        │
│  ├── 支持多种 NAS 算法                                       │
│  └── 架构评估                                               │
│                                                             │
│  模型压缩:                                                  │
│  ├── 剪枝                                                   │
│  ├── 量化                                                   │
│  └── 知识蒸馏                                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

核心概念 #

Experiment(实验) #

Experiment 定义了优化任务的整体配置。

text
Experiment 组成:
├── Objective - 优化目标
│   ├── type: maximize/minimize
│   ├── goal: 目标值
│   └── objectiveMetricName: 指标名称
│
├── Algorithm - 优化算法
│   ├── algorithmName: 算法类型
│   └── algorithmSettings: 算法参数
│
├── Parameters - 搜索空间
│   ├── parameterType: 参数类型
│   └── feasibleSpace: 取值范围
│
└── Trial Template - 试验模板
    ├── WorkerSpec: 工作负载定义
    └── MetricsCollectorSpec: 指标收集

Trial(试验) #

Trial 是 Experiment 的一次具体执行,使用特定的超参数组合。

text
Trial 生命周期:
├── Created - 创建试验
├── Running - 运行中
├── Succeeded - 成功完成
├── Failed - 失败
└── EarlyStopped - 早停

Suggestion(建议) #

Suggestion 是优化算法生成的超参数建议。

text
Suggestion 工作流程:
├── Experiment 创建 Suggestion
├── Suggestion 调用优化算法
├── 生成超参数建议
├── 创建 Trial 执行
└── 收集结果,继续优化

支持的优化算法 #

算法概览 #

text
┌─────────────────────────────────────────────────────────────┐
│                   Katib 支持的算法                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  黑盒优化算法:                                              │
│  ├── Random Search - 随机搜索                               │
│  ├── Grid Search - 网格搜索                                 │
│  └── Bayesian Optimization - 贝叶斯优化                     │
│                                                             │
│  多保真度算法:                                              │
│  ├── Hyperband - 超带优化                                   │
│  └── Successive Halving - 连续减半                          │
│                                                             │
│  神经架构搜索:                                              │
│  ├── ENAS - 高效神经架构搜索                                │
│  └── DARTS - 可微分架构搜索                                 │
│                                                             │
│  其他算法:                                                  │
│  ├── TPE - 树结构 Parzen 估计器                             │
│  ├── CMA-ES - 协方差矩阵自适应进化策略                      │
│  └── Sobol - Sobol 序列                                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

随机搜索是最简单的优化算法,在搜索空间中随机采样。

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-search
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
  - name: learning_rate
    parameterType: double
    feasibleSpace:
      min: "0.001"
      max: "0.1"
  - name: batch_size
    parameterType: int
    feasibleSpace:
      min: "16"
      max: "128"

网格搜索穷举搜索空间中的所有组合。

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: grid-search
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: grid
  parameters:
  - name: learning_rate
    parameterType: double
    feasibleSpace:
      list:
      - "0.001"
      - "0.01"
      - "0.1"
  - name: batch_size
    parameterType: int
    feasibleSpace:
      list:
      - "16"
      - "32"
      - "64"

Bayesian Optimization #

贝叶斯优化使用高斯过程建模目标函数。

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: bayesian-optimization
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    goal: 0.98
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
    - name: random_state
      value: "42"
    - name: n_initial_points
      value: "5"
  parallelTrialCount: 2
  maxTrialCount: 20
  parameters:
  - name: learning_rate
    parameterType: double
    feasibleSpace:
      min: "0.0001"
      max: "0.1"
  - name: hidden_units
    parameterType: int
    feasibleSpace:
      min: "32"
      max: "512"

Hyperband #

Hyperband 是一种高效的多保真度优化算法。

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: hyperband
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: hyperband
    algorithmSettings:
    - name: eta
      value: "3"
    - name: resource_name
      value: "epochs"
    - name: resource_type
      value: "int"
  parameters:
  - name: learning_rate
    parameterType: double
    feasibleSpace:
      min: "0.001"
      max: "0.1"
  trialTemplate:
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - name: training
              image: python:3.9
              command:
              - python
              - -c
              - |
                import sys
                learning_rate = float(sys.argv[1])
                epochs = int(sys.argv[2])
                print(f"Training with lr={learning_rate}, epochs={epochs}")
                print(f"accuracy={0.8 + 0.1 * (1 - learning_rate)}")
              - "${trialParameters.learning_rate}"
              - "${trialParameters.resources}"

TPE (Tree-structured Parzen Estimator) #

TPE 是一种高效的贝叶斯优化变体。

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: tpe-optimization
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: minimize
    objectiveMetricName: loss
  algorithm:
    algorithmName: tpe
    algorithmSettings:
    - name: n_startup_trials
      value: "10"
    - name: n_ei_candidates
      value: "24"
  parallelTrialCount: 2
  maxTrialCount: 30
  parameters:
  - name: lr
    parameterType: double
    feasibleSpace:
      min: "0.0001"
      max: "0.1"
  - name: weight_decay
    parameterType: double
    feasibleSpace:
      min: "0.00001"
      max: "0.01"

创建超参数优化实验 #

基本实验配置 #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: mnist-tuning
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
    additionalMetricNames:
    - loss
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
  - name: learning_rate
    parameterType: double
    feasibleSpace:
      min: "0.001"
      max: "0.1"
      step: "0.001"
  - name: batch_size
    parameterType: int
    feasibleSpace:
      min: "16"
      max: "128"
  - name: hidden_units
    parameterType: categorical
    feasibleSpace:
      list:
      - "64"
      - "128"
      - "256"
  trialTemplate:
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - name: training
              image: tensorflow/tensorflow:2.12.0
              command:
              - python
              - -c
              - |
                import tensorflow as tf
                import numpy as np
                
                learning_rate = float("${trialParameters.learning_rate}")
                batch_size = int("${trialParameters.batch_size}")
                hidden_units = int("${trialParameters.hidden_units}")
                
                print(f"Training with lr={learning_rate}, batch={batch_size}, units={hidden_units}")
                
                mnist = tf.keras.datasets.mnist
                (x_train, y_train), (x_test, y_test) = mnist.load_data()
                x_train, x_test = x_train / 255.0, x_test / 255.0
                
                model = tf.keras.models.Sequential([
                  tf.keras.layers.Flatten(input_shape=(28, 28)),
                  tf.keras.layers.Dense(hidden_units, activation='relu'),
                  tf.keras.layers.Dropout(0.2),
                  tf.keras.layers.Dense(10)
                ])
                
                optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
                loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
                model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])
                
                model.fit(x_train, y_train, epochs=5, batch_size=batch_size, verbose=0)
                loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
                
                print(f"loss={loss}")
                print(f"accuracy={accuracy}")
            restartPolicy: Never
    trialParameters:
    - name: learning_rate
      reference: learning_rate
    - name: batch_size
      reference: batch_size
    - name: hidden_units
      reference: hidden_units

使用 TFJob 作为 Trial #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: tfjob-tuning
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    goal: 0.98
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 2
  maxTrialCount: 10
  parameters:
  - name: learning_rate
    parameterType: double
    feasibleSpace:
      min: "0.0001"
      max: "0.01"
  trialTemplate:
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Chief:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                - name: tensorflow
                  image: tensorflow/tensorflow:2.12.0
                  command:
                  - python
                  - /opt/model/train.py
                  - --learning-rate=${trialParameters.learning_rate}
                  resources:
                    limits:
                      nvidia.com/gpu: 1
    trialParameters:
    - name: learning_rate
      reference: learning_rate

使用 PyTorchJob 作为 Trial #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: pytorchjob-tuning
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: minimize
    objectiveMetricName: loss
  algorithm:
    algorithmName: hyperband
  parameters:
  - name: lr
    parameterType: double
    feasibleSpace:
      min: "0.001"
      max: "0.1"
  trialTemplate:
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                - name: pytorch
                  image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
                  command:
                  - python
                  - /opt/model/train.py
                  - --lr=${trialParameters.lr}
    trialParameters:
    - name: lr
      reference: lr

参数类型 #

数值参数 #

yaml
parameters:
# 浮点数参数
- name: learning_rate
  parameterType: double
  feasibleSpace:
    min: "0.0001"
    max: "0.1"
    step: "0.0001"  # 可选步长

# 整数参数
- name: batch_size
  parameterType: int
  feasibleSpace:
    min: "16"
    max: "256"
    step: "16"  # 可选步长

分类参数 #

yaml
parameters:
# 分类参数
- name: optimizer
  parameterType: categorical
  feasibleSpace:
    list:
    - "adam"
    - "sgd"
    - "rmsprop"

# 离散参数
- name: hidden_units
  parameterType: discrete
  feasibleSpace:
    list:
    - "64"
    - "128"
    - "256"
    - "512"

早停策略 #

配置早停 #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: early-stopping-example
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  earlyStopping:
    algorithmName: medianstop
    algorithmSettings:
    - name: min_trials_required
      value: "3"
    - name: start_step
      value: "5"
  parallelTrialCount: 3
  maxTrialCount: 20
  parameters:
  - name: learning_rate
    parameterType: double
    feasibleSpace:
      min: "0.001"
      max: "0.1"

支持的早停算法 #

text
早停算法:
├── medianstop - 中位数停止
├── enas - ENAS 早停
└── hyperband - Hyperband 早停

指标收集 #

File Metrics Collector #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: file-metrics
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    collector:
      kind: File
      source:
        fileSystemPath:
          path: /output/metrics.txt
          format: text
  parameters:
  - name: lr
    parameterType: double
    feasibleSpace:
      min: "0.001"
      max: "0.1"
  trialTemplate:
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - name: training
              image: python:3.9
              command:
              - sh
              - -c
              - |
                python train.py --lr ${trialParameters.lr}
                echo "accuracy=0.95" > /output/metrics.txt

Prometheus Metrics Collector #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: prometheus-metrics
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    collector:
      kind: Prometheus
      source:
        httpGet:
          path: /metrics
          port: 8080
  parameters:
  - name: lr
    parameterType: double
    feasibleSpace:
      min: "0.001"
      max: "0.1"

Custom Metrics Collector #

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: custom-metrics
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    collector:
      kind: Custom
      customCollector:
        container:
          image: python:3.9
          command:
          - python
          - /scripts/collect_metrics.py
          args:
          - --trial-name=$(TRIAL_NAME)
          - --namespace=$(NAMESPACE)
  parameters:
  - name: lr
    parameterType: double
    feasibleSpace:
      min: "0.001"
      max: "0.1"

管理实验 #

查看实验状态 #

bash
# 列出所有实验
kubectl get experiments -n kubeflow-user-example-com

# 查看实验详情
kubectl describe experiment mnist-tuning -n kubeflow-user-example-com

# 查看实验状态
kubectl get experiment mnist-tuning -n kubeflow-user-example-com -o jsonpath='{.status}'

# 查看试验列表
kubectl get trials -n kubeflow-user-example-com -l experiment=mnist-tuning

查看试验结果 #

bash
# 查看试验详情
kubectl describe trial mnist-tuning-xxxx -n kubeflow-user-example-com

# 查看试验指标
kubectl get trial mnist-tuning-xxxx -n kubeflow-user-example-com -o jsonpath='{.status.observation.metrics}'

# 查看 Suggestion
kubectl get suggestions -n kubeflow-user-example-com

停止和删除实验 #

bash
# 停止实验(将 maxTrialCount 设为当前 trial 数量)
kubectl patch experiment mnist-tuning -n kubeflow-user-example-com --type merge -p '{"spec":{"maxTrialCount":0}}'

# 删除实验
kubectl delete experiment mnist-tuning -n kubeflow-user-example-com

最佳实践 #

实验设计 #

text
1. 搜索空间设计
   ├── 从大范围开始,逐步缩小
   ├── 使用对数尺度搜索学习率
   └── 合理设置参数边界

2. 算法选择
   ├── 小规模实验:Random Search
   ├── 中等规模:Bayesian Optimization
   └── 大规模:Hyperband

3. 并行配置
   ├── 根据资源设置并行数
   ├── Bayesian 不宜过多并行
   └── Random Search 可高并行

4. 早停策略
   ├── 长时间训练使用早停
   ├── 合理设置开始步数
   └── 避免过早停止

资源管理 #

text
1. 资源配置
   ├── 设置资源请求和限制
   ├── 使用 GPU 时注意调度
   └── 合理设置并行数

2. 存储管理
   ├── 使用持久化存储
   ├── 清理不需要的工件
   └── 共享数据集

3. 时间管理
   ├── 设置合理的最大试验数
   ├── 使用早停节省时间
   └── 记录实验结果

下一步 #

现在你已经掌握了 Katib 超参数调优,接下来学习 模型服务,了解如何部署和管理模型服务!

最后更新:2026-04-05