性能优化 #

一、性能瓶颈分析 #

1.1 常见性能问题 #

text

常见性能问题：

┌─────────────────────────────────────────────┐
│ 1. 查询慢                                   │
├─────────────────────────────────────────────┤
│ • 复杂PromQL查询                            │
│ • 大范围时间查询                            │
│ • 高基数标签                                │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 2. 采集慢                                   │
├─────────────────────────────────────────────┤
│ • 目标数量过多                              │
│ • 指标数量过多                              │
│ • 网络延迟                                  │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 3. 存储慢                                   │
├─────────────────────────────────────────────┤
│ • 磁盘IO瓶颈                                │
│ • 内存不足                                  │
│ • 压缩延迟                                  │
└─────────────────────────────────────────────┘

1.2 性能监控指标 #

promql

# 查询延迟
prometheus_http_request_duration_seconds_sum{handler="/api/v1/query"}
prometheus_http_request_duration_seconds_count{handler="/api/v1/query"}

# 采集延迟
scrape_duration_seconds

# 评估延迟
prometheus_rule_evaluation_duration_seconds

# 内存使用
process_resident_memory_bytes

# 磁盘使用
prometheus_tsdb_storage_blocks_bytes

# 时间序列数量
prometheus_tsdb_head_series

# 采集目标数
count(up)

二、查询优化 #

2.1 减少时间序列数量 #

promql

# 不推荐：查询所有指标
http_requests_total

# 推荐：使用标签过滤
http_requests_total{job="api-server"}

# 推荐：使用Recording Rules预聚合
job:http_requests:rate5m

2.2 缩小查询范围 #

promql

# 不推荐：大范围查询
http_requests_total[30d]

# 推荐：合理范围
http_requests_total[5m]

# 不推荐：全量查询
http_requests_total

# 推荐：使用聚合
sum by (method) (http_requests_total)

2.3 使用Recording Rules #

yaml

# recording_rules.yml

groups:
  - name: precomputed_rules
    interval: 30s
    rules:
      # 预计算请求速率
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      
      # 预计算错误率
      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
      
      # 预计算可用性
      - record: job:availability:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))
      
      # 预计算P99延迟
      - record: job:http_request_duration:p99_5m
        expr: |
          histogram_quantile(0.99, 
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

2.4 避免高基数标签 #

yaml

# 不推荐：使用高基数标签
http_requests_total{user_id="12345"}

# 推荐：不使用高基数标签
# 高基数信息记录在日志中

三、采集优化 #

3.1 调整采集间隔 #

yaml

# prometheus.yml

global:
  scrape_interval: 15s    # 默认采集间隔

scrape_configs:
  # 重要服务：高频采集
  - job_name: 'critical-service'
    scrape_interval: 10s
    static_configs:
      - targets: ['critical:8080']
  
  # 一般服务：正常采集
  - job_name: 'normal-service'
    scrape_interval: 30s
    static_configs:
      - targets: ['normal:8080']
  
  # 历史数据：低频采集
  - job_name: 'historical-data'
    scrape_interval: 5m
    static_configs:
      - targets: ['historical:8080']

3.2 过滤不需要的指标 #

yaml

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
    
    # 过滤不需要的指标
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_.*'
        action: drop
      
      - source_labels: [__name__]
        regex: 'node_arp_.*'
        action: drop
      
      # 只保留需要的指标
      - source_labels: [__name__]
        regex: 'node_(cpu|memory|disk|network).*'
        action: keep

3.3 分片采集 #

yaml

# Prometheus实例1：采集前半部分目标
scrape_configs:
  - job_name: 'node-exporter-shard1'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
          - 'node3:9100'

# Prometheus实例2：采集后半部分目标
scrape_configs:
  - job_name: 'node-exporter-shard2'
    static_configs:
      - targets:
          - 'node4:9100'
          - 'node5:9100'
          - 'node6:9100'

四、存储优化 #

4.1 调整数据保留时间 #

bash

# 启动参数
prometheus \
    --storage.tsdb.retention.time=30d

# 或按大小保留
prometheus \
    --storage.tsdb.retention.size=10GB

4.2 调整压缩参数 #

bash

# 调整数据块大小
prometheus \
    --storage.tsdb.max-block-duration=3h \
    --storage.tsdb.min-block-duration=2h

4.3 使用SSD存储 #

text

存储建议：

┌─────────────────────────────────────────────┐
│ 推荐配置                                    │
├─────────────────────────────────────────────┤
│ • 使用SSD存储                               │
│ • 避免网络存储                              │
│ • 预留足够空间                              │
│ • 定期监控磁盘使用                          │
└─────────────────────────────────────────────┘

五、内存优化 #

5.1 监控内存使用 #

promql

# 进程内存
process_resident_memory_bytes

# Go堆内存
go_memstats_heap_inuse_bytes

# 系统内存
go_memstats_sys_bytes

5.2 调整内存参数 #

bash

# 调整查询并发数
prometheus --query.max-concurrency=20

# 调整查询超时
prometheus --query.timeout=2m

# 调整样本限制
prometheus --query.max-samples=50000000

六、网络优化 #

6.1 调整连接参数 #

bash

# 调整最大连接数
prometheus --web.max-connections=512

# 调整采集超时
prometheus --scrape.timeout=10s

6.2 使用本地缓存 #

yaml

# 使用Recording Rules缓存计算结果
groups:
  - name: cache_rules
    interval: 30s
    rules:
      - record: cached:metric
        expr: expensive_query

七、总结 #

优化要点：

优化项	方法
查询优化	使用Recording Rules、缩小范围
采集优化	调整间隔、过滤指标
存储优化	SSD、调整保留时间
内存优化	调整并发、限制样本

监控指标：

指标	说明
查询延迟	prometheus_http_request_duration_seconds
采集延迟	scrape_duration_seconds
内存使用	process_resident_memory_bytes

下一步，让我们学习容量规划！