监控与日志 #
概述 #
监控和日志是保障 Kubeflow 平台稳定运行的关键。本章介绍如何配置和使用 Kubeflow 的监控系统。
监控内容 #
text
┌─────────────────────────────────────────────────────────────┐
│ 监控内容 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 基础设施监控: │
│ ├── 节点资源使用 │
│ ├── Pod 状态 │
│ ├── 网络流量 │
│ └── 存储使用 │
│ │
│ 应用监控: │
│ ├── Pipeline 运行状态 │
│ ├── 训练作业进度 │
│ ├── 模型服务性能 │
│ └── Notebook 使用 │
│ │
│ 日志管理: │
│ ├── 集中式日志收集 │
│ ├── 日志查询和分析 │
│ ├── 日志告警 │
│ └── 日志归档 │
│ │
└─────────────────────────────────────────────────────────────┘
Prometheus 监控 #
安装 Prometheus #
bash
# 使用 kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
配置 ServiceMonitor #
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kubeflow-monitor
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchLabels:
app: kubeflow
namespaceSelector:
matchNames:
- kubeflow
- istio-system
endpoints:
- port: metrics
interval: 30s
Pipeline 指标 #
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ml-pipeline
namespace: monitoring
spec:
selector:
matchLabels:
app: ml-pipeline
namespaceSelector:
matchNames:
- kubeflow
endpoints:
- port: metrics
path: /metrics
interval: 30s
训练作业指标 #
yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: training-jobs
namespace: monitoring
spec:
selector:
matchLabels:
kubeflow.org/job-type: training
namespaceSelector:
any: true
podMetricsEndpoints:
- port: metrics
interval: 15s
Grafana 仪表板 #
安装 Grafana #
bash
# Grafana 通常随 kube-prometheus-stack 一起安装
# 或单独安装
helm install grafana grafana/grafana --namespace monitoring
Kubeflow 仪表板配置 #
yaml
apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
name: kubeflow-overview
namespace: monitoring
labels:
app: grafana
spec:
json: |
{
"dashboard": {
"title": "Kubeflow Overview",
"panels": [
{
"title": "Pipeline Runs",
"type": "graph",
"targets": [
{
"expr": "rate(kubeflow_pipeline_runs_total[5m])",
"legendFormat": "{{status}}"
}
]
},
{
"title": "Training Jobs",
"type": "graph",
"targets": [
{
"expr": "kubeflow_training_jobs_active",
"legendFormat": "{{job_type}}"
}
]
},
{
"title": "GPU Utilization",
"type": "graph",
"targets": [
{
"expr": "DCGM_FI_DEV_GPU_UTIL",
"legendFormat": "{{gpu}}"
}
]
}
]
}
}
自定义仪表板 #
json
{
"dashboard": {
"title": "ML Training Dashboard",
"panels": [
{
"title": "Training Progress",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "training_loss",
"legendFormat": "Loss"
},
{
"expr": "training_accuracy",
"legendFormat": "Accuracy"
}
]
},
{
"title": "GPU Memory",
"type": "gauge",
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100",
"legendFormat": "Memory Usage %"
}
]
}
]
}
}
日志管理 #
安装 Loki #
bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack --namespace monitoring
配置 Promtail #
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: monitoring
data:
promtail.yaml: |
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubeflow-pods
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- kubeflow
- istio-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
日志查询 #
bash
# 使用 LogCLI 查询日志
logcli query '{namespace="kubeflow"}' --limit=100
# 查询特定应用的日志
logcli query '{app="ml-pipeline"} |= "error"'
# 查询训练日志
logcli query '{namespace="kubeflow-user-alice"} |~ "training"'
在 Grafana 中查看日志 #
text
1. 配置 Loki 数据源
├── URL: http://loki:3100
└── 访问模式: Server
2. 创建日志面板
├── 选择 Explore
├── 选择 Loki 数据源
└── 输入查询语句
3. 常用查询
├── {namespace="kubeflow"} |= "error"
├── {app="notebook"} | json | line_format "{{.message}}"
└── {pod=~"tensorflow-job-.*"} |~ "epoch"
告警配置 #
Prometheus 告警规则 #
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubeflow-alerts
namespace: monitoring
spec:
groups:
- name: kubeflow.rules
rules:
- alert: PipelineRunFailed
expr: kubeflow_pipeline_run_status{status="failed"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pipeline run failed"
description: "Pipeline {{ $labels.pipeline }} failed"
- alert: TrainingJobStuck
expr: kubeflow_training_job_duration_seconds > 86400
for: 10m
labels:
severity: warning
annotations:
summary: "Training job running too long"
description: "Training job {{ $labels.job_name }} has been running for over 24 hours"
- alert: GPUUtilizationLow
expr: DCGM_FI_DEV_GPU_UTIL < 10
for: 30m
labels:
severity: info
annotations:
summary: "GPU utilization is low"
description: "GPU {{ $labels.gpu }} utilization is below 10%"
- alert: NotebookPodHighMemory
expr: container_memory_usage_bytes{pod=~"jupyter-.*"} / container_spec_memory_limit_bytes{pod=~"jupyter-.*"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Notebook memory usage high"
description: "Notebook {{ $labels.pod }} is using over 90% of memory limit"
Alertmanager 配置 #
yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts-critical'
- name: 'warning'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts-warning'
训练监控 #
自定义训练指标 #
python
import tensorflow as tf
from prometheus_client import Counter, Gauge, Summary, start_http_server
training_loss = Gauge('training_loss', 'Current training loss')
training_accuracy = Gauge('training_accuracy', 'Current training accuracy')
training_steps = Counter('training_steps_total', 'Total training steps')
training_duration = Summary('training_duration_seconds', 'Time spent training')
start_http_server(8000)
class MetricsCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
training_loss.set(logs['loss'])
training_accuracy.set(logs['accuracy'])
training_steps.inc()
@training_duration.time()
def train_model():
model.fit(x_train, y_train, callbacks=[MetricsCallback()])
PyTorch 训练指标 #
python
import torch
from prometheus_client import Gauge, Counter, start_http_server
import os
rank = int(os.environ.get('RANK', 0))
if rank == 0:
start_http_server(8000)
training_loss = Gauge('training_loss', 'Training loss')
training_accuracy = Gauge('training_accuracy', 'Training accuracy')
epochs_completed = Counter('epochs_completed', 'Epochs completed')
for epoch in range(num_epochs):
for batch_idx, (data, target) in enumerate(train_loader):
loss = train_step(model, data, target)
if rank == 0:
training_loss.set(loss.item())
if rank == 0:
epochs_completed.inc()
性能分析 #
使用 Pyroscope #
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pyroscope-config
namespace: monitoring
data:
pyroscope.yaml: |
server:
http_listen_port: 4040
storage:
type: badger
badger:
directory: /data
性能分析集成 #
python
import pyroscope
pyroscope.configure(
app_name="ml-training",
server_address="http://pyroscope.monitoring.svc:4040"
)
def train_with_profiling():
with pyroscope.tag_wrapper({"model": "resnet50"}):
train_model()
监控最佳实践 #
指标设计 #
text
1. 关键指标
├── 基础设施指标
│ ├── CPU/内存使用率
│ ├── GPU 使用率
│ └── 网络流量
├── 应用指标
│ ├── 请求延迟
│ ├── 错误率
│ └── 吞吐量
└── 业务指标
├── 训练进度
├── 模型准确率
└── Pipeline 成功率
2. 告警设计
├── 分级告警
├── 合理阈值
└── 避免告警疲劳
3. 仪表板设计
├── 分层展示
├── 关键指标突出
└── 支持下钻分析
日志管理 #
text
1. 日志收集
├── 结构化日志
├── 统一日志格式
└── 合理的日志级别
2. 日志存储
├── 设置保留期限
├── 日志压缩
└── 冷热数据分离
3. 日志分析
├── 建立日志索引
├── 配置日志告警
└── 定期日志审计
故障排查 #
常用查询 #
bash
# 查看 Pod 状态
kubectl get pods -n kubeflow -w
# 查看 Pod 事件
kubectl describe pod <pod-name> -n kubeflow
# 查看日志
kubectl logs <pod-name> -n kubeflow -c <container> --tail=100
# 查看资源使用
kubectl top pods -n kubeflow
kubectl top nodes
# 查看事件
kubectl get events -n kubeflow --sort-by='.lastTimestamp'
常见问题排查 #
text
1. Pod 无法启动
├── 检查资源配额
├── 检查镜像拉取
├── 检查存储挂载
└── 检查调度约束
2. 训练卡住
├── 检查 GPU 状态
├── 检查网络通信
├── 检查数据加载
└── 检查日志输出
3. 服务不可用
├── 检查 Ingress 配置
├── 检查服务状态
├── 检查网络策略
└── 检查证书配置
下一步 #
现在你已经掌握了监控与日志配置,接下来学习 端到端示例,通过实际案例巩固所学知识!
最后更新:2026-04-05