模型服务 #

概述 #

模型服务是将训练好的机器学习模型部署为可访问的 API 服务,使其能够处理实际的推理请求。Kubeflow 支持多种模型服务方案。

支持的服务框架 #

text
┌─────────────────────────────────────────────────────────────┐
│                   模型服务框架                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  KServe (推荐):                                            │
│  ├── Kubernetes 原生模型服务                                │
│  ├── 支持多种模型框架                                       │
│  ├── 自动扩缩容                                            │
│  └── 高级推理功能                                          │
│                                                             │
│  Seldon Core:                                              │
│  ├── MLOps 平台                                            │
│  ├── 灵活的推理图                                          │
│  ├── 解释器支持                                            │
│  └── A/B 测试                                              │
│                                                             │
│  TFServing:                                                │
│  ├── TensorFlow 官方服务                                   │
│  ├── 高性能推理                                            │
│  └── 模型热更新                                            │
│                                                             │
│  Triton Inference Server:                                  │
│  ├── NVIDIA 推理服务                                       │
│  ├── 多框架支持                                            │
│  └── GPU 优化                                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

KServe #

KServe 概述 #

KServe(原 KFServing)是 Kubernetes 原生的模型服务解决方案,提供高性能、可扩展的模型推理服务。

text
KServe 核心特性:
├── 多框架支持
│   ├── TensorFlow
│   ├── PyTorch
│   ├── scikit-learn
│   ├── XGBoost
│   ├── ONNX
│   └── 自定义
│
├── 自动扩缩容
│   ├── Knative Serving
│   ├── 基于请求的自动扩缩
│   └── 缩减到零
│
├── 高级功能
│   ├── Canary 发布
│   ├── A/B 测试
│   ├── 模型解释
│   └── Transformer
│
└── 协议支持
    ├── V1 (REST)
    ├── V2 (REST/gRPC)
    └── 自定义协议

安装 KServe #

bash
# 安装 KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.0/kserve.yaml

# 安装 KServe Runtime
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.0/kserve-cluster-resources.yaml

# 验证安装
kubectl get pods -n kserve

部署 InferenceService #

TensorFlow 模型 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: tensorflow-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    tensorflow:
      storageUri: gs://kfserving-examples/models/tensorflow/iris
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
        limits:
          cpu: "2"
          memory: "4Gi"

PyTorch 模型 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: pytorch-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    pytorch:
      storageUri: gs://kfserving-examples/models/pytorch/iris
      modelClassName: IrisNet
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"

scikit-learn 模型 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    sklearn:
      storageUri: gs://kfserving-examples/models/sklearn/iris
      protocolVersion: v2

XGBoost 模型 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: xgboost-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    xgboost:
      storageUri: gs://kfserving-examples/models/xgboost/iris
      protocolVersion: v2

ONNX 模型 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: onnx-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    onnx:
      storageUri: gs://kfserving-examples/models/onnx/iris

自定义模型 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    containers:
    - name: custom-container
      image: my-registry/custom-model:latest
      ports:
      - containerPort: 8080
        protocol: TCP
      env:
      - name: MODEL_PATH
        value: /models/model.pkl
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"

GPU 支持 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gpu-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    tensorflow:
      storageUri: gs://my-bucket/models/tensorflow
      resources:
        requests:
          cpu: "2"
          memory: "8Gi"
          nvidia.com/gpu: "1"
        limits:
          nvidia.com/gpu: "1"

推理协议 #

V1 协议 (REST) #

python
import requests
import json

url = "http://sklearn-model.kubeflow-user-example-com.example.com/v1/models/sklearn-model:predict"

data = {
    "instances": [
        [5.1, 3.5, 1.4, 0.2],
        [6.2, 2.9, 4.3, 1.3]
    ]
}

response = requests.post(url, json=data)
predictions = response.json()
print(predictions)

V2 协议 (REST) #

python
import requests
import json
import numpy as np

url = "http://sklearn-model.kubeflow-user-example-com.example.com/v2/models/sklearn-model/infer"

data = {
    "id": "request-1",
    "inputs": [
        {
            "name": "input-0",
            "shape": [2, 4],
            "datatype": "FP32",
            "data": [
                [5.1, 3.5, 1.4, 0.2],
                [6.2, 2.9, 4.3, 1.3]
            ]
        }
    ]
}

response = requests.post(url, json=data)
result = response.json()
print(result)

gRPC 调用 #

python
import grpc
from kserve.protocol.grpc import predict_pb2, prediction_service_pb2_grpc

channel = grpc.insecure_channel('sklearn-model.kubeflow-user-example-com.example.com:9000')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

request = predict_pb2.PredictRequest()
request.model_spec.name = 'sklearn-model'
request.inputs['input-0'].CopyFrom(
    tf.make_tensor_proto(
        values=[[5.1, 3.5, 1.4, 0.2]],
        dtype=tf.float32,
        shape=[1, 4]
    )
)

response = stub.Predict(request)
print(response)

高级功能 #

Canary 发布 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: canary-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    canaryTrafficPercent: 20
    model:
      modelFormat:
        name: sklearn
      storageUri: gs://my-bucket/models/sklearn/v2
    sklearn:
      storageUri: gs://my-bucket/models/sklearn/v1

A/B 测试 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: ab-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    sklearn:
      storageUri: gs://my-bucket/models/sklearn/v1
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
---
apiVersion: serving.kserve.io/v1alpha1
kind: InferenceGraph
metadata:
  name: ab-test-graph
  namespace: kubeflow-user-example-com
spec:
  nodes:
    router:
      routerType: Splitter
      routes:
      - nodeName: model-a
        weight: 70
      - nodeName: model-b
        weight: 30
    model-a:
      serviceName: ab-model
    model-b:
      serviceName: ab-model-v2

Transformer #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: transformer-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    sklearn:
      storageUri: gs://my-bucket/models/sklearn/iris
  transformer:
    containers:
    - name: transformer
      image: my-registry/transformer:latest
      command:
      - python
      - -m
      - transformer
      - --model_name
      - transformer-model
      - --protocol
      - v2

模型解释 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: explainable-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    sklearn:
      storageUri: gs://my-bucket/models/sklearn/iris
  explainer:
    alibi:
      type: AnchorTabular
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"

Seldon Core #

Seldon Core 概述 #

Seldon Core 是一个开源的 MLOps 平台,提供灵活的模型部署和管理能力。

text
Seldon Core 特性:
├── 推理图
│   ├── 模型组合
│   ├── 数据预处理
│   └── 后处理
│
├── 模型解释
│   ├── Alibi Explain
│   ├── 多种解释方法
│   └── 可视化
│
├── 监控
│   ├── Prometheus 指标
│   ├── Grafana 仪表板
│   └── 请求日志
│
└── 高级部署
    ├── A/B 测试
    ├── 金丝雀发布
    └── 多臂老虎机

安装 Seldon Core #

bash
# 安装 Seldon Core
kubectl create namespace seldon-system
helm install seldon-core seldon-core-operator \
  --repo https://storage.googleapis.com/seldon-charts \
  --namespace seldon-system \
  --set usageMetrics.enabled=true

# 安装 Istio(用于 Ingress)
kubectl apply -f https://raw.githubusercontent.com/SeldonIO/seldon-core/master/examples/notebooks/seldon_core_istio.yaml

部署 SeldonDeployment #

yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: sklearn-model
  namespace: kubeflow-user-example-com
spec:
  predictors:
  - name: default
    replicas: 1
    graph:
      name: classifier
      implementation: SKLEARN_SERVER
      modelUri: gs://seldon-models/sklearn/iris
      envSecretRefName: seldon-rclone-secret
    componentSpecs:
    - spec:
        containers:
        - name: classifier
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"

推理图 #

yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: inference-graph
  namespace: kubeflow-user-example-com
spec:
  predictors:
  - name: default
    graph:
      name: pipeline
      implementation: ENSEMBLE
      children:
      - name: preprocess
        implementation: CUSTOM
        modelUri: gs://my-bucket/preprocess
        children:
        - name: model-a
          implementation: SKLEARN_SERVER
          modelUri: gs://my-bucket/model-a
        - name: model-b
          implementation: SKLEARN_SERVER
          modelUri: gs://my-bucket/model-b

模型存储 #

存储选项 #

text
支持的存储:
├── S3 / MinIO
├── Google Cloud Storage (GCS)
├── Azure Blob Storage
├── PVC (Persistent Volume Claim)
├── HTTP/HTTPS
└── 本地存储(测试)

S3 存储 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: s3-model
  namespace: kubeflow-user-example-com
  annotations:
    serving.kserve.io/s3-secret-name: aws-secret
spec:
  predictor:
    sklearn:
      storageUri: s3://my-bucket/models/sklearn/iris

PVC 存储 #

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
  namespace: kubeflow-user-example-com
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: pvc-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    sklearn:
      storageUri: pvc://model-pvc/models/sklearn

自动扩缩容 #

Knative 配置 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: autoscale-model
  namespace: kubeflow-user-example-com
  annotations:
    autoscaling.knative.dev/target: "10"
    autoscaling.knative.dev/minScale: "1"
    autoscaling.knative.dev/maxScale: "10"
spec:
  predictor:
    sklearn:
      storageUri: gs://my-bucket/models/sklearn/iris

HPA 配置 #

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-hpa
  namespace: kubeflow-user-example-com
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sklearn-model-predictor
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

监控和日志 #

Prometheus 指标 #

yaml
apiVersion: v1
kind: ServiceMonitor
metadata:
  name: kserve-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: kserve
  endpoints:
  - port: metrics
    interval: 30s

请求日志 #

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: logged-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    logger:
      mode: all
      url: http://log-server/logging
    sklearn:
      storageUri: gs://my-bucket/models/sklearn/iris

管理服务 #

查看服务状态 #

bash
# 列出 InferenceService
kubectl get inferenceservices -n kubeflow-user-example-com

# 查看服务详情
kubectl describe inferenceservice sklearn-model -n kubeflow-user-example-com

# 查看 Pod 状态
kubectl get pods -n kubeflow-user-example-com -l serving.kserve.io/inferenceservice=sklearn-model

# 查看日志
kubectl logs -n kubeflow-user-example-com -l serving.kserve.io/inferenceservice=sklearn-model

更新服务 #

bash
# 更新模型
kubectl patch inferenceservice sklearn-model -n kubeflow-user-example-com --type merge -p '
{
  "spec": {
    "predictor": {
      "sklearn": {
        "storageUri": "gs://my-bucket/models/sklearn/v2"
      }
    }
  }
}'

删除服务 #

bash
# 删除 InferenceService
kubectl delete inferenceservice sklearn-model -n kubeflow-user-example-com

最佳实践 #

部署策略 #

text
1. 版本管理
   ├── 使用语义化版本
   ├── 保留历史版本
   └── 支持快速回滚

2. 发布策略
   ├── 金丝雀发布
   ├── A/B 测试
   └── 蓝绿部署

3. 资源配置
   ├── 合理设置资源请求
   ├── 配置自动扩缩容
   └── 监控资源使用

性能优化 #

text
1. 模型优化
   ├── 模型量化
   ├── 模型剪枝
   └── 知识蒸馏

2. 推理优化
   ├── 批处理
   ├── 缓存
   └── 异步处理

3. 资源优化
   ├── GPU 加速
   ├── 多模型服务
   └── 资源共享

下一步 #

现在你已经掌握了模型服务,接下来学习 训练概述,了解 Kubeflow 的训练作业管理!

最后更新:2026-04-05