Prometheus指标类型 #

一、指标类型概述 #

1.1 四种指标类型 #

text

Prometheus指标类型：

┌─────────────────────────────────────────────┐
│ 1. Counter（计数器）                        │
│    • 只增不减的累积值                       │
│    • 用于计数场景                           │
│    • 示例：请求数、错误数                   │
├─────────────────────────────────────────────┤
│ 2. Gauge（仪表盘）                          │
│    • 可增可减的瞬时值                       │
│    • 用于当前状态                           │
│    • 示例：温度、内存使用                   │
├─────────────────────────────────────────────┤
│ 3. Histogram（直方图）                      │
│    • 观测值的分布统计                       │
│    • 自动计算分位数                         │
│    • 示例：请求延迟分布                     │
├─────────────────────────────────────────────┤
│ 4. Summary（摘要）                          │
│    • 观测值的分位数统计                     │
│    • 客户端计算分位数                       │
│    • 示例：请求延迟分位数                   │
└─────────────────────────────────────────────┘

1.2 类型选择指南 #

text

指标类型选择：

需要记录什么？
│
├── 累积数量（只增不减）
│   └── 使用 Counter
│       示例：请求总数、错误总数
│
├── 当前状态（可增可减）
│   └── 使用 Gauge
│       示例：温度、内存使用、队列长度
│
├── 分布情况（需要分位数）
│   │
│   ├── 可以接受近似值
│   │   └── 使用 Histogram
│   │       优点：可聚合、可配置桶
│   │
│   └── 需要精确分位数
│       └── 使用 Summary
│           优点：精确分位数
│           缺点：不可聚合

二、Counter（计数器） #

2.1 概念说明 #

text

Counter特性：

┌─────────────────────────────────────────────┐
│ 定义                                        │
├─────────────────────────────────────────────┤
│ • 只能增加，不能减少（重启归零）            │
│ • 用于记录累积值                            │
│ • 通常配合rate()使用                        │
└─────────────────────────────────────────────┘

时间线示例：

值
│
│                              ┌───── 1000
│                         ┌────┘
│                    ┌────┘
│               ┌────┘
│          ┌────┘
│     ┌────┘
│ ────┘
└─────────────────────────────────> 时间
    0    1    2    3    4    5

Counter只增不减，表示累积的总量

2.2 使用场景 #

text

Counter适用场景：

┌─────────────────────────────────────────────┐
│ 1. 请求计数                                 │
│    http_requests_total                      │
│    • 记录HTTP请求总数                       │
│    • 配合rate计算QPS                        │
├─────────────────────────────────────────────┤
│ 2. 错误计数                                 │
│    http_errors_total                        │
│    • 记录错误总数                           │
│    • 计算错误率                             │
├─────────────────────────────────────────────┤
│ 3. 任务完成数                               │
│    tasks_completed_total                    │
│    • 记录完成任务数                         │
│    • 计算完成速率                           │
├─────────────────────────────────────────────┤
│ 4. 数据处理量                               │
│    bytes_processed_total                    │
│    • 记录处理字节数                         │
│    • 计算吞吐量                             │
└─────────────────────────────────────────────┘

2.3 常用查询 #

promql

# 计算每秒请求数（QPS）
rate(http_requests_total[5m])

# 计算每分钟请求数
increase(http_requests_total[1m])

# 按状态码分组计算QPS
sum by (status) (rate(http_requests_total[5m]))

# 计算错误率
sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m]))

# 按服务分组计算QPS
sum by (service) (rate(http_requests_total[5m]))

# 计算过去1小时的总请求数
increase(http_requests_total[1h])

2.4 代码示例 #

python

# Python客户端示例
from prometheus_client import Counter

# 创建Counter
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# 增加计数
http_requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
http_requests_total.labels(method='POST', endpoint='/api/users', status='201').inc()

# 增加指定值
http_requests_total.labels(method='GET', endpoint='/api/users', status='200').inc(5)

// Go客户端示例
package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
)

func main() {
    // 增加计数
    httpRequestsTotal.WithLabelValues("GET", "/api/users", "200").Inc()
    httpRequestsTotal.WithLabelValues("POST", "/api/users", "201").Inc()
    
    // 增加指定值
    httpRequestsTotal.WithLabelValues("GET", "/api/users", "200").Add(5)
}

三、Gauge（仪表盘） #

3.1 概念说明 #

text

Gauge特性：

┌─────────────────────────────────────────────┐
│ 定义                                        │
├─────────────────────────────────────────────┤
│ • 可增可减的瞬时值                          │
│ • 用于记录当前状态                          │
│ • 直接使用，无需rate()                      │
└─────────────────────────────────────────────┘

时间线示例：

值
│     ┌─────┐
│     │     │     ┌───┐
│ ────┘     └─────┘   └────
│
└─────────────────────────────────> 时间
    0    1    2    3    4    5

Gauge可增可减，表示当前状态

3.2 使用场景 #

text

Gauge适用场景：

┌─────────────────────────────────────────────┐
│ 1. 系统资源                                 │
│    node_memory_MemAvailable_bytes           │
│    • 当前可用内存                           │
│    node_cpu_seconds_total                   │
│    • CPU使用时间                            │
├─────────────────────────────────────────────┤
│ 2. 应用状态                                 │
│    process_open_fds                         │
│    • 打开的文件描述符                       │
│    process_resident_memory_bytes            │
│    • 进程内存使用                           │
├─────────────────────────────────────────────┤
│ 3. 业务指标                                 │
│    queue_length                             │
│    • 队列长度                               │
│    active_connections                       │
│    • 活跃连接数                             │
├─────────────────────────────────────────────┤
│ 4. 环境数据                                 │
│    room_temperature_celsius                 │
│    • 室内温度                               │
│    humidity_percent                         │
│    • 湿度百分比                             │
└─────────────────────────────────────────────┘

3.3 常用查询 #

promql

# 直接查询当前值
node_memory_MemAvailable_bytes

# 计算使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 最大值
max(node_memory_MemAvailable_bytes)

# 最小值
min(node_memory_MemAvailable_bytes)

# 平均值
avg(node_memory_MemAvailable_bytes)

# 求和
sum(node_memory_MemAvailable_bytes)

# 变化量
delta(node_memory_MemAvailable_bytes[1h])

# 变化率
deriv(node_memory_MemAvailable_bytes[1h])

# 预测未来值（线性回归）
predict_linear(node_memory_MemAvailable_bytes[1h], 3600)

3.4 代码示例 #

python

# Python客户端示例
from prometheus_client import Gauge

# 创建Gauge
queue_length = Gauge(
    'queue_length',
    'Current queue length',
    ['queue_name']
)

temperature = Gauge(
    'room_temperature_celsius',
    'Current room temperature'
)

# 设置值
queue_length.labels(queue_name='email').set(100)
temperature.set(25.5)

# 增加值
queue_length.labels(queue_name='email').inc()

# 减少值
queue_length.labels(queue_name='email').dec()

# 增加指定值
queue_length.labels(queue_name='email').inc(10)

# 减少指定值
queue_length.labels(queue_name='email').dec(5)

// Go客户端示例
package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    queueLength = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "queue_length",
            Help: "Current queue length",
        },
        []string{"queue_name"},
    )
    
    temperature = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "room_temperature_celsius",
        Help: "Current room temperature",
    })
)

func main() {
    // 设置值
    queueLength.WithLabelValues("email").Set(100)
    temperature.Set(25.5)
    
    // 增加值
    queueLength.WithLabelValues("email").Inc()
    
    // 减少值
    queueLength.WithLabelValues("email").Dec()
    
    // 增加指定值
    queueLength.WithLabelValues("email").Add(10)
    
    // 减少指定值
    queueLength.WithLabelValues("email").Sub(5)
}

四、Histogram（直方图） #

4.1 概念说明 #

text

Histogram特性：

┌─────────────────────────────────────────────┐
│ 定义                                        │
├─────────────────────────────────────────────┤
│ • 对观测值进行采样并统计分布                │
│ • 自动生成多个时间序列                      │
│ • 服务端计算分位数                          │
│ • 可聚合                                    │
└─────────────────────────────────────────────┘

生成的时间序列：

http_request_duration_seconds
├── http_request_duration_seconds_bucket{le="0.1"}    # ≤0.1秒的请求数
├── http_request_duration_seconds_bucket{le="0.5"}    # ≤0.5秒的请求数
├── http_request_duration_seconds_bucket{le="1.0"}    # ≤1.0秒的请求数
├── http_request_duration_seconds_bucket{le="+Inf"}   # 所有请求数
├── http_request_duration_seconds_sum                 # 总时间
└── http_request_duration_seconds_count               # 总请求数

桶（Bucket）结构：

请求延迟分布：
┌─────────────────────────────────────────────┐
│ 桶边界      │ 计数  │ 累积计数              │
├─────────────────────────────────────────────┤
│ ≤0.1s       │  100  │  100                  │
│ ≤0.5s       │  200  │  300                  │
│ ≤1.0s       │  150  │  450                  │
│ ≤2.0s       │  100  │  550                  │
│ +Inf        │   50  │  600                  │
└─────────────────────────────────────────────┘

4.2 使用场景 #

text

Histogram适用场景：

┌─────────────────────────────────────────────┐
│ 1. 请求延迟                                 │
│    http_request_duration_seconds            │
│    • 记录请求响应时间                       │
│    • 计算P50、P90、P99延迟                  │
├─────────────────────────────────────────────┤
│ 2. 响应大小                                 │
│    http_response_size_bytes                 │
│    • 记录响应数据大小                       │
│    • 分析响应大小分布                       │
├─────────────────────────────────────────────┤
│ 3. 处理时间                                 │
│    task_processing_duration_seconds         │
│    • 记录任务处理时间                       │
│    • 分析处理时间分布                       │
├─────────────────────────────────────────────┤
│ 4. 批量大小                                 │
│    batch_size                               │
│    • 记录批量处理大小                       │
│    • 分析批量大小分布                       │
└─────────────────────────────────────────────┘

4.3 常用查询 #

promql

# 计算P50延迟（50%分位数）
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# 计算P90延迟（90%分位数）
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m]))

# 计算P99延迟（99%分位数）
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# 按服务分组计算P99延迟
histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))

# 计算平均延迟
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# 计算请求速率
rate(http_request_duration_seconds_count[5m])

# 计算延迟分布（各桶占比）
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))

4.4 代码示例 #

python

# Python客户端示例
from prometheus_client import Histogram

# 创建Histogram
http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, float('inf'))
)

# 记录观测值
with http_request_duration.labels(method='GET', endpoint='/api/users').time():
    # 处理请求
    pass

# 手动记录
http_request_duration.labels(method='GET', endpoint='/api/users').observe(0.25)

// Go客户端示例
package main

import (
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: []float64{0.1, 0.5, 1.0, 2.0, 5.0, 10.0},
        },
        []string{"method", "endpoint"},
    )
)

func main() {
    // 使用Timer
    timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues("GET", "/api/users"))
    // 处理请求
    timer.ObserveDuration()
    
    // 手动记录
    httpRequestDuration.WithLabelValues("GET", "/api/users").Observe(0.25)
    
    // 使用time.Since
    start := time.Now()
    // 处理请求
    duration := time.Since(start).Seconds()
    httpRequestDuration.WithLabelValues("GET", "/api/users").Observe(duration)
}

五、Summary（摘要） #

5.1 概念说明 #

text

Summary特性：

┌─────────────────────────────────────────────┐
│ 定义                                        │
├─────────────────────────────────────────────┤
│ • 对观测值进行采样并计算分位数              │
│ • 客户端计算分位数                          │
│ • 分位数精确                                │
│ • 不可聚合                                  │
└─────────────────────────────────────────────┘

生成的时间序列：

http_request_duration_seconds
├── http_request_duration_seconds{quantile="0.5"}   # P50
├── http_request_duration_seconds{quantile="0.9"}   # P90
├── http_request_duration_seconds{quantile="0.99"}  # P99
├── http_request_duration_seconds_sum               # 总时间
└── http_request_duration_seconds_count             # 总请求数

Histogram vs Summary：

┌──────────────────┬─────────────────┬─────────────────┐
│ 特性             │ Histogram       │ Summary         │
├──────────────────┼─────────────────┼─────────────────┤
│ 分位数计算       │ 服务端          │ 客户端          │
│ 分位数精确度     │ 近似            │ 精确            │
│ 可聚合性         │ 可              │ 不可            │
│ 可配置分位数     │ 是              │ 是              │
│ 存储成本         │ 较高            │ 较低            │
│ 推荐使用         │ 推荐            │ 特殊场景        │
└──────────────────┴─────────────────┴─────────────────┘

5.2 使用场景 #

text

Summary适用场景：

┌─────────────────────────────────────────────┐
│ 适合使用Summary：                           │
├─────────────────────────────────────────────┤
│ • 需要精确分位数                            │
│ • 不需要聚合多个实例                        │
│ • 分位数配置在客户端确定                    │
│ • 延迟敏感的场景                            │
└─────────────────────────────────────────────┘

不推荐使用Summary：
• 需要聚合多个实例的分位数
• 需要在查询时动态计算分位数
• 需要灵活配置分位数

5.3 常用查询 #

promql

# 查询P50延迟
http_request_duration_seconds{quantile="0.5"}

# 查询P90延迟
http_request_duration_seconds{quantile="0.9"}

# 查询P99延迟
http_request_duration_seconds{quantile="0.99"}

# 计算平均延迟
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# 计算请求速率
rate(http_request_duration_seconds_count[5m])

# 注意：Summary不能跨实例聚合分位数
# 以下查询是错误的：
# avg(http_request_duration_seconds{quantile="0.99"})  # 无意义

5.4 代码示例 #

python

# Python客户端示例
from prometheus_client import Summary

# 创建Summary
http_request_duration = Summary(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint']
)

# 设置分位数（默认P50、P90、P99）
http_request_duration = Summary(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    invariants=[0.5, 0.9, 0.99]
)

# 记录观测值
with http_request_duration.labels(method='GET', endpoint='/api/users').time():
    # 处理请求
    pass

# 手动记录
http_request_duration.labels(method='GET', endpoint='/api/users').observe(0.25)

// Go客户端示例
package main

import (
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpRequestDuration = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_request_duration_seconds",
            Help:       "HTTP request duration in seconds",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"method", "endpoint"},
    )
)

func main() {
    // 使用Timer
    timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues("GET", "/api/users"))
    // 处理请求
    timer.ObserveDuration()
    
    // 手动记录
    httpRequestDuration.WithLabelValues("GET", "/api/users").Observe(0.25)
}

六、指标类型对比 #

6.1 对比表格 #

text

指标类型对比：

┌──────────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ 特性         │ Counter     │ Gauge       │ Histogram   │ Summary     │
├──────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 值类型       │ 只增        │ 可增可减    │ 观测值      │ 观测值      │
│ 主要用途     │ 累积计数    │ 当前状态    │ 分布统计    │ 分位数      │
│ 典型场景     │ 请求数      │ 内存使用    │ 延迟分布    │ 延迟分位数  │
│ 配合函数     │ rate()      │ 直接使用    │ histogram_  │ 直接查询    │
│              │ increase()  │             │ quantile()  │             │
│ 可聚合       │ 是          │ 是          │ 是          │ 否          │
│ 存储成本     │ 低          │ 低          │ 高          │ 中          │
│ 计算成本     │ 低          │ 低          │ 服务端计算  │ 客户端计算  │
└──────────────┴─────────────┴─────────────┴─────────────┴─────────────┘

6.2 选择建议 #

text

指标类型选择建议：

1. 记录数量（只增不减）
   └── 使用 Counter
       • 请求数、错误数、处理量

2. 记录状态（可增可减）
   └── 使用 Gauge
       • 内存、CPU、队列长度、温度

3. 记录分布（需要分位数）
   │
   ├── 多实例需要聚合
   │   └── 使用 Histogram
   │       • 可聚合计算分位数
   │       • 服务端计算
   │
   └── 单实例精确分位数
       └── 使用 Summary
           • 客户端计算精确分位数
           • 不可聚合

推荐：优先使用 Histogram

七、总结 #

指标类型要点：

类型	特点	使用场景
Counter	只增不减	请求数、错误数
Gauge	可增可减	内存、温度、队列长度
Histogram	分布统计	延迟分布、响应大小
Summary	分位数统计	精确分位数

常用查询：

类型	典型查询
Counter	`rate(metric[5m])`
Gauge	`metric` 或 `avg(metric)`
Histogram	`histogram_quantile(0.99, rate(metric_bucket[5m]))`
Summary	`metric{quantile="0.99"}`

下一步，让我们学习标签与标签选择器！