监控与日志 #

概述 #

监控和日志是保障 Kubeflow 平台稳定运行的关键。本章介绍如何配置和使用 Kubeflow 的监控系统。

监控内容 #

text
┌─────────────────────────────────────────────────────────────┐
│                    监控内容                                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  基础设施监控:                                              │
│  ├── 节点资源使用                                           │
│  ├── Pod 状态                                              │
│  ├── 网络流量                                              │
│  └── 存储使用                                              │
│                                                             │
│  应用监控:                                                  │
│  ├── Pipeline 运行状态                                      │
│  ├── 训练作业进度                                           │
│  ├── 模型服务性能                                           │
│  └── Notebook 使用                                         │
│                                                             │
│  日志管理:                                                  │
│  ├── 集中式日志收集                                         │
│  ├── 日志查询和分析                                         │
│  ├── 日志告警                                              │
│  └── 日志归档                                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Prometheus 监控 #

安装 Prometheus #

bash
# 使用 kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

配置 ServiceMonitor #

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kubeflow-monitor
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: kubeflow
  namespaceSelector:
    matchNames:
    - kubeflow
    - istio-system
  endpoints:
  - port: metrics
    interval: 30s

Pipeline 指标 #

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ml-pipeline
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: ml-pipeline
  namespaceSelector:
    matchNames:
    - kubeflow
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

训练作业指标 #

yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: training-jobs
  namespace: monitoring
spec:
  selector:
    matchLabels:
      kubeflow.org/job-type: training
  namespaceSelector:
    any: true
  podMetricsEndpoints:
  - port: metrics
    interval: 15s

Grafana 仪表板 #

安装 Grafana #

bash
# Grafana 通常随 kube-prometheus-stack 一起安装
# 或单独安装
helm install grafana grafana/grafana --namespace monitoring

Kubeflow 仪表板配置 #

yaml
apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
  name: kubeflow-overview
  namespace: monitoring
  labels:
    app: grafana
spec:
  json: |
    {
      "dashboard": {
        "title": "Kubeflow Overview",
        "panels": [
          {
            "title": "Pipeline Runs",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(kubeflow_pipeline_runs_total[5m])",
                "legendFormat": "{{status}}"
              }
            ]
          },
          {
            "title": "Training Jobs",
            "type": "graph",
            "targets": [
              {
                "expr": "kubeflow_training_jobs_active",
                "legendFormat": "{{job_type}}"
              }
            ]
          },
          {
            "title": "GPU Utilization",
            "type": "graph",
            "targets": [
              {
                "expr": "DCGM_FI_DEV_GPU_UTIL",
                "legendFormat": "{{gpu}}"
              }
            ]
          }
        ]
      }
    }

自定义仪表板 #

json
{
  "dashboard": {
    "title": "ML Training Dashboard",
    "panels": [
      {
        "title": "Training Progress",
        "type": "graph",
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "training_loss",
            "legendFormat": "Loss"
          },
          {
            "expr": "training_accuracy",
            "legendFormat": "Accuracy"
          }
        ]
      },
      {
        "title": "GPU Memory",
        "type": "gauge",
        "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100",
            "legendFormat": "Memory Usage %"
          }
        ]
      }
    ]
  }
}

日志管理 #

安装 Loki #

bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack --namespace monitoring

配置 Promtail #

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: monitoring
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
    positions:
      filename: /tmp/positions.yaml
    clients:
      - url: http://loki:3100/loki/api/v1/push
    scrape_configs:
    - job_name: kubeflow-pods
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - kubeflow
          - istio-system
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

日志查询 #

bash
# 使用 LogCLI 查询日志
logcli query '{namespace="kubeflow"}' --limit=100

# 查询特定应用的日志
logcli query '{app="ml-pipeline"} |= "error"'

# 查询训练日志
logcli query '{namespace="kubeflow-user-alice"} |~ "training"'

在 Grafana 中查看日志 #

text
1. 配置 Loki 数据源
   ├── URL: http://loki:3100
   └── 访问模式: Server

2. 创建日志面板
   ├── 选择 Explore
   ├── 选择 Loki 数据源
   └── 输入查询语句

3. 常用查询
   ├── {namespace="kubeflow"} |= "error"
   ├── {app="notebook"} | json | line_format "{{.message}}"
   └── {pod=~"tensorflow-job-.*"} |~ "epoch"

告警配置 #

Prometheus 告警规则 #

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubeflow-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubeflow.rules
    rules:
    - alert: PipelineRunFailed
      expr: kubeflow_pipeline_run_status{status="failed"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pipeline run failed"
        description: "Pipeline {{ $labels.pipeline }} failed"
    
    - alert: TrainingJobStuck
      expr: kubeflow_training_job_duration_seconds > 86400
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Training job running too long"
        description: "Training job {{ $labels.job_name }} has been running for over 24 hours"
    
    - alert: GPUUtilizationLow
      expr: DCGM_FI_DEV_GPU_UTIL < 10
      for: 30m
      labels:
        severity: info
      annotations:
        summary: "GPU utilization is low"
        description: "GPU {{ $labels.gpu }} utilization is below 10%"
    
    - alert: NotebookPodHighMemory
      expr: container_memory_usage_bytes{pod=~"jupyter-.*"} / container_spec_memory_limit_bytes{pod=~"jupyter-.*"} > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Notebook memory usage high"
        description: "Notebook {{ $labels.pod }} is using over 90% of memory limit"

Alertmanager 配置 #

yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alerts@example.com'
      smtp_auth_username: 'alerts@example.com'
      smtp_auth_password: 'password'
    
    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'default'
      routes:
      - match:
          severity: critical
        receiver: 'critical'
      - match:
          severity: warning
        receiver: 'warning'
    
    receivers:
    - name: 'default'
      email_configs:
      - to: 'team@example.com'
    
    - name: 'critical'
      email_configs:
      - to: 'oncall@example.com'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-critical'
    
    - name: 'warning'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-warning'

训练监控 #

自定义训练指标 #

python
import tensorflow as tf
from prometheus_client import Counter, Gauge, Summary, start_http_server

training_loss = Gauge('training_loss', 'Current training loss')
training_accuracy = Gauge('training_accuracy', 'Current training accuracy')
training_steps = Counter('training_steps_total', 'Total training steps')
training_duration = Summary('training_duration_seconds', 'Time spent training')

start_http_server(8000)

class MetricsCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        training_loss.set(logs['loss'])
        training_accuracy.set(logs['accuracy'])
        training_steps.inc()

@training_duration.time()
def train_model():
    model.fit(x_train, y_train, callbacks=[MetricsCallback()])

PyTorch 训练指标 #

python
import torch
from prometheus_client import Gauge, Counter, start_http_server
import os

rank = int(os.environ.get('RANK', 0))

if rank == 0:
    start_http_server(8000)

training_loss = Gauge('training_loss', 'Training loss')
training_accuracy = Gauge('training_accuracy', 'Training accuracy')
epochs_completed = Counter('epochs_completed', 'Epochs completed')

for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        loss = train_step(model, data, target)
        
        if rank == 0:
            training_loss.set(loss.item())
    
    if rank == 0:
        epochs_completed.inc()

性能分析 #

使用 Pyroscope #

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: pyroscope-config
  namespace: monitoring
data:
  pyroscope.yaml: |
    server:
      http_listen_port: 4040
    storage:
      type: badger
      badger:
        directory: /data

性能分析集成 #

python
import pyroscope

pyroscope.configure(
    app_name="ml-training",
    server_address="http://pyroscope.monitoring.svc:4040"
)

def train_with_profiling():
    with pyroscope.tag_wrapper({"model": "resnet50"}):
        train_model()

监控最佳实践 #

指标设计 #

text
1. 关键指标
   ├── 基础设施指标
   │   ├── CPU/内存使用率
   │   ├── GPU 使用率
   │   └── 网络流量
   ├── 应用指标
   │   ├── 请求延迟
   │   ├── 错误率
   │   └── 吞吐量
   └── 业务指标
       ├── 训练进度
       ├── 模型准确率
       └── Pipeline 成功率

2. 告警设计
   ├── 分级告警
   ├── 合理阈值
   └── 避免告警疲劳

3. 仪表板设计
   ├── 分层展示
   ├── 关键指标突出
   └── 支持下钻分析

日志管理 #

text
1. 日志收集
   ├── 结构化日志
   ├── 统一日志格式
   └── 合理的日志级别

2. 日志存储
   ├── 设置保留期限
   ├── 日志压缩
   └── 冷热数据分离

3. 日志分析
   ├── 建立日志索引
   ├── 配置日志告警
   └── 定期日志审计

故障排查 #

常用查询 #

bash
# 查看 Pod 状态
kubectl get pods -n kubeflow -w

# 查看 Pod 事件
kubectl describe pod <pod-name> -n kubeflow

# 查看日志
kubectl logs <pod-name> -n kubeflow -c <container> --tail=100

# 查看资源使用
kubectl top pods -n kubeflow
kubectl top nodes

# 查看事件
kubectl get events -n kubeflow --sort-by='.lastTimestamp'

常见问题排查 #

text
1. Pod 无法启动
   ├── 检查资源配额
   ├── 检查镜像拉取
   ├── 检查存储挂载
   └── 检查调度约束

2. 训练卡住
   ├── 检查 GPU 状态
   ├── 检查网络通信
   ├── 检查数据加载
   └── 检查日志输出

3. 服务不可用
   ├── 检查 Ingress 配置
   ├── 检查服务状态
   ├── 检查网络策略
   └── 检查证书配置

下一步 #

现在你已经掌握了监控与日志配置,接下来学习 端到端示例,通过实际案例巩固所学知识!

最后更新:2026-04-05