故障排查 #

一、故障排查方法 #

1.1 排查流程 #

text

故障排查流程：

┌─────────────────────────────────────────────┐
│ 1. 确认问题                                 │
├─────────────────────────────────────────────┤
│ • 问题描述                                  │
│ • 影响范围                                  │
│ • 发生时间                                  │
└─────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────┐
│ 2. 收集信息                                 │
├─────────────────────────────────────────────┤
│ • 日志                                      │
│ • 指标                                      │
│ • 配置                                      │
└─────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────┐
│ 3. 分析原因                                 │
├─────────────────────────────────────────────┤
│ • 检查配置                                  │
│ • 检查资源                                  │
│ • 检查网络                                  │
└─────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────┐
│ 4. 解决问题                                 │
├─────────────────────────────────────────────┤
│ • 修复配置                                  │
│ • 调整资源                                  │
│ • 重启服务                                  │
└─────────────────────────────────────────────┘

1.2 常用诊断命令 #

bash

# 检查服务状态
systemctl status prometheus

# 查看日志
journalctl -u prometheus -f

# 检查端口
netstat -tlnp | grep 9090

# 检查进程
ps aux | grep prometheus

# 检查配置
promtool check config prometheus.yml

# 检查规则
promtool check rules alerting_rules.yml

# 健康检查
curl http://localhost:9090/-/healthy

# 就绪检查
curl http://localhost:9090/-/ready

二、常见问题 #

2.1 服务无法启动 #

text

问题：Prometheus无法启动

可能原因：
┌─────────────────────────────────────────────┐
│ 1. 配置文件错误                             │
├─────────────────────────────────────────────┤
│ • YAML语法错误                              │
│ • 配置项错误                                │
│ • 文件路径错误                              │
├─────────────────────────────────────────────┤
│ 2. 端口被占用                               │
├─────────────────────────────────────────────┤
│ • 其他服务占用端口                          │
│ • 多个实例冲突                              │
├─────────────────────────────────────────────┤
│ 3. 权限问题                                 │
├─────────────────────────────────────────────┤
│ • 数据目录权限                              │
│ • 配置文件权限                              │
└─────────────────────────────────────────────┘

解决方法：

# 检查配置
promtool check config prometheus.yml

# 检查端口
netstat -tlnp | grep 9090

# 检查权限
ls -la /var/lib/prometheus
chown -R prometheus:prometheus /var/lib/prometheus

# 查看详细错误
prometheus --config.file=prometheus.yml --log.level=debug

2.2 采集失败 #

text

问题：目标采集失败

可能原因：
┌─────────────────────────────────────────────┐
│ 1. 目标不可达                               │
├─────────────────────────────────────────────┤
│ • 网络不通                                  │
│ • 目标服务未启动                            │
│ • 防火墙阻止                                │
├─────────────────────────────────────────────┤
│ 2. 超时                                     │
├─────────────────────────────────────────────┤
│ • 目标响应慢                                │
│ • 网络延迟                                  │
├─────────────────────────────────────────────┤
│ 3. 认证失败                                 │
├─────────────────────────────────────────────┤
│ • 认证信息错误                              │
│ • TLS证书问题                               │
└─────────────────────────────────────────────┘

解决方法：

# 检查目标状态
curl http://localhost:9090/api/v1/targets

# 手动测试采集
curl http://target:9100/metrics

# 检查网络
ping target
telnet target 9100

# 检查日志
journalctl -u prometheus | grep "scrape"

2.3 查询慢 #

text

问题：查询响应慢

可能原因：
┌─────────────────────────────────────────────┐
│ 1. 查询复杂                                 │
├─────────────────────────────────────────────┤
│ • 复杂PromQL                                │
│ • 大范围查询                                │
│ • 高基数标签                                │
├─────────────────────────────────────────────┤
│ 2. 资源不足                                 │
├─────────────────────────────────────────────┤
│ • CPU不足                                   │
│ • 内存不足                                  │
│ • 磁盘IO慢                                  │
├─────────────────────────────────────────────┤
│ 3. 数据量大                                 │
├─────────────────────────────────────────────┤
│ • 时间序列过多                              │
│ • 样本数量大                                │
└─────────────────────────────────────────────┘

解决方法：

# 检查查询延迟
rate(prometheus_http_request_duration_seconds_sum{handler="/api/v1/query"}[5m])

# 检查时间序列数量
prometheus_tsdb_head_series

# 检查内存使用
process_resident_memory_bytes

# 优化查询
# 使用Recording Rules
# 缩小查询范围
# 过滤不需要的指标

2.4 磁盘空间不足 #

text

问题：磁盘空间不足

可能原因：
┌─────────────────────────────────────────────┐
│ 1. 数据保留时间过长                         │
├─────────────────────────────────────────────┤
│ • retention设置过大                         │
│ • 未设置大小限制                            │
├─────────────────────────────────────────────┤
│ 2. 时间序列过多                             │
├─────────────────────────────────────────────┤
│ • 高基数标签                                │
│ • 采集目标过多                              │
├─────────────────────────────────────────────┤
│ 3. 压缩延迟                                 │
├─────────────────────────────────────────────┤
│ • 压缩速度慢                                │
│ • 历史数据未清理                            │
└─────────────────────────────────────────────┘

解决方法：

# 检查磁盘使用
df -h /var/lib/prometheus

# 检查数据大小
du -sh /var/lib/prometheus/*

# 调整保留时间
prometheus --storage.tsdb.retention.time=15d

# 或设置大小限制
prometheus --storage.tsdb.retention.size=50GB

# 清理数据（需要启用admin API）
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="old-job"}'
curl -X POST 'http://localhost:9090/api/v1/admin/tsdb/clean_tombstones'

2.5 告警不触发 #

text

问题：告警不触发

可能原因：
┌─────────────────────────────────────────────┐
│ 1. 规则配置错误                             │
├─────────────────────────────────────────────┤
│ • PromQL表达式错误                          │
│ • for时间设置不当                           │
│ • 规则文件未加载                            │
├─────────────────────────────────────────────┤
│ 2. Alertmanager配置错误                     │
├─────────────────────────────────────────────┤
│ • Alertmanager地址错误                      │
│ • 路由配置错误                              │
│ • 接收者配置错误                            │
├─────────────────────────────────────────────┤
│ 3. 通知渠道问题                             │
├─────────────────────────────────────────────┤
│ • 邮件服务器问题                            │
│ • Slack webhook错误                         │
│ • 网络问题                                  │
└─────────────────────────────────────────────┘

解决方法：

# 检查告警规则
curl http://localhost:9090/api/v1/rules

# 检查告警状态
curl http://localhost:9090/api/v1/alerts

# 检查Alertmanager连接
curl http://localhost:9093/api/v1/status

# 检查Alertmanager告警
curl http://localhost:9093/api/v1/alerts

# 检查日志
journalctl -u prometheus | grep "alert"
journalctl -u alertmanager | grep "notify"

三、诊断工具 #

3.1 promtool #

bash

# 检查配置
promtool check config prometheus.yml

# 检查规则
promtool check rules alerting_rules.yml

# 检查TSDB
promtool tsdb check /var/lib/prometheus

# 查询数据
promtool query instant http://localhost:9090 up
promtool query range http://localhost:9090 up --start=2024-01-01T00:00:00Z --end=2024-01-01T01:00:00Z

# 调试标签
promtool debug labels http://localhost:9090 job

3.2 内置诊断端点 #

bash

# 健康检查
curl http://localhost:9090/-/healthy

# 就绪检查
curl http://localhost:9090/-/ready

# 运行时信息
curl http://localhost:9090/api/v1/status/runtimeinfo

# 配置信息
curl http://localhost:9090/api/v1/status/config

# 构建信息
curl http://localhost:9090/api/v1/status/buildinfo

# 目标状态
curl http://localhost:9090/api/v1/targets

# 服务发现
curl http://localhost:9090/api/v1/targets/metadata

3.3 性能分析 #

bash

# 启用pprof
# 访问 http://localhost:9090/debug/pprof/

# CPU分析
curl http://localhost:9090/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof

# 内存分析
curl http://localhost:9090/debug/pprof/heap > heap.prof
go tool pprof heap.prof

# goroutine分析
curl http://localhost:9090/debug/pprof/goroutine > goroutine.prof
go tool pprof goroutine.prof

四、日志分析 #

4.1 日志配置 #

bash

# 设置日志级别
prometheus --log.level=debug

# 设置日志格式
prometheus --log.format=json

4.2 常见日志错误 #

text

常见日志错误：

1. "error loading config"
   → 配置文件语法错误

2. "error refreshing service discovery"
   → 服务发现配置错误

3. "error scraping target"
   → 目标采集失败

4. "error sending alert"
   → Alertmanager连接失败

5. "out of memory"
   → 内存不足

6. "disk full"
   → 磁盘空间不足

五、总结 #

排查流程：

步骤	说明
确认问题	描述、范围、时间
收集信息	日志、指标、配置
分析原因	配置、资源、网络
解决问题	修复、调整、重启

常用命令：

命令	说明
promtool check	检查配置和规则
curl /-/healthy	健康检查
curl /api/v1/targets	目标状态

恭喜你完成Prometheus学习之旅！