联邦集群 #

一、联邦概述 #

1.1 什么是联邦 #

text

联邦定义：

┌─────────────────────────────────────────────┐
│ Prometheus Federation                       │
├─────────────────────────────────────────────┤
│ • 多个Prometheus实例联合工作                │
│ • 从其他Prometheus拉取数据                  │
│ • 支持层级架构                              │
│ • 实现全局视图                              │
└─────────────────────────────────────────────┘

联邦架构：

┌─────────────────────────────────────────────────────────┐
│                    全局Prometheus                        │
│                    (Global Level)                        │
│                         │                                │
│         ┌───────────────┼───────────────┐               │
│         ▼               ▼               ▼               │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐        │
│ │ 区域Prometheus│ │ 区域Prometheus│ │ 区域Prometheus│        │
│ │ (Region A)  │ │ (Region B)  │ │ (Region C)  │        │
│ │             │ │             │ │             │        │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │        │
│ │ │实例1    │ │ │ │实例1    │ │ │ │实例1    │ │        │
│ │ │实例2    │ │ │ │实例2    │ │ │ │实例2    │ │        │
│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │        │
│ └─────────────┘ └─────────────┘ └─────────────┘        │
└─────────────────────────────────────────────────────────┘

1.2 联邦应用场景 #

text

联邦应用场景：

┌─────────────────────────────────────────────┐
│ 1. 多数据中心                               │
├─────────────────────────────────────────────┤
│ • 每个数据中心一个Prometheus                │
│ • 全局Prometheus聚合数据                    │
│ • 实现跨数据中心监控                        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 2. 多租户环境                               │
├─────────────────────────────────────────────┤
│ • 每个租户一个Prometheus                    │
│ • 全局Prometheus聚合关键指标                │
│ • 实现租户隔离和全局视图                    │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 3. 层级监控                                 │
├─────────────────────────────────────────────┤
│ • 服务级Prometheus监控服务                  │
│ • 集群级Prometheus聚合服务数据              │
│ • 全局Prometheus聚合集群数据                │
└─────────────────────────────────────────────┘

二、联邦配置 #

2.1 基本联邦配置 #

yaml

# 全局Prometheus配置

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'
    
    static_configs:
      - targets:
          - 'prometheus-region-a:9090'
          - 'prometheus-region-b:9090'
        labels:
          federate: 'global'

2.2 参数说明 #

text

联邦参数：

┌─────────────────────────────────────────────┐
│ metrics_path                                │
├─────────────────────────────────────────────┤
│ • 联邦端点：/federate                       │
│ • 返回匹配的指标数据                        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ match[]                                     │
├─────────────────────────────────────────────┤
│ • 指定要拉取的指标                          │
│ • 支持标签选择器                            │
│ • 可以指定多个match参数                     │
│ • 示例：'{job="prometheus"}'                │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ honor_labels                                │
├─────────────────────────────────────────────┤
│ • true：保留源标签                          │
│ • false：覆盖源标签                         │
│ • 联邦场景通常设为true                      │
└─────────────────────────────────────────────┘

2.3 选择性联邦 #

yaml

# 只拉取特定指标

scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # 只拉取聚合指标
        - '{__name__=~"job:.*"}'
        - '{__name__=~"namespace:.*"}'
        
        # 只拉取特定job
        - '{job="node-exporter"}'
        
        # 只拉取特定指标
        - '{__name__=~"up|scrape_duration_seconds"}'
    
    static_configs:
      - targets:
          - 'prometheus-region-a:9090'
          - 'prometheus-region-b:9090'

三、联邦最佳实践 #

3.1 使用Recording Rules #

yaml

# 区域Prometheus Recording Rules

groups:
  - name: aggregation_rules
    rules:
      # 预聚合请求速率
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      
      # 预聚合错误率
      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
      
      # 预聚合延迟
      - record: job:http_request_duration:p99_5m
        expr: |
          histogram_quantile(0.99, 
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

yaml

# 全局Prometheus联邦配置

scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # 只拉取预聚合指标
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
          - 'prometheus-region-a:9090'
          - 'prometheus-region-b:9090'

3.2 添加外部标签 #

yaml

# 区域Prometheus配置

global:
  external_labels:
    region: 'region-a'
    datacenter: 'dc1'

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']

3.3 分层联邦 #

yaml

# 第一层：服务级Prometheus
# 只采集服务指标

scrape_configs:
  - job_name: 'app'
    static_configs:
      - targets: ['app1:8080', 'app2:8080']

# 第二层：集群级Prometheus
# 从服务级联邦聚合

scrape_configs:
  - job_name: 'federate-services'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"service:.*"}'
    static_configs:
      - targets:
          - 'prometheus-service-a:9090'
          - 'prometheus-service-b:9090'

# 第三层：全局Prometheus
# 从集群级联邦聚合

scrape_configs:
  - job_name: 'federate-clusters'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"cluster:.*"}'
    static_configs:
      - targets:
          - 'prometheus-cluster-a:9090'
          - 'prometheus-cluster-b:9090'

四、联邦API #

4.1 联邦端点 #

bash

# 访问联邦端点
curl 'http://localhost:9090/federate?match[]={job="prometheus"}'

# 多个match参数
curl 'http://localhost:9090/federate?match[]={job="prometheus"}&match[]={job="node-exporter"}'

# 正则匹配
curl 'http://localhost:9090/federate?match[]={__name__=~"up|scrape_duration_seconds"}'

4.2 响应格式 #

text

# 联邦响应格式

# HELP up The up status of the target.
# TYPE up gauge
up{job="prometheus",instance="localhost:9090"} 1
up{job="node-exporter",instance="localhost:9100"} 1

# HELP scrape_duration_seconds Duration of the scrape.
# TYPE scrape_duration_seconds gauge
scrape_duration_seconds{job="prometheus"} 0.005

五、总结 #

联邦配置要点：

配置项	说明
metrics_path	/federate
match[]	指定拉取的指标
honor_labels	保留源标签

最佳实践：

实践	说明
Recording Rules	预聚合指标
外部标签	标识数据来源
选择性联邦	只拉取关键指标

下一步，让我们学习远程存储！