高可用架构 #

一、高可用概述 #

1.1 为什么需要高可用 #

text

单点故障风险：

┌─────────────────────────────────────────────┐
│ 单Prometheus实例                            │
├─────────────────────────────────────────────┤
│ • 硬件故障导致监控中断                      │
│ • 网络问题导致数据丢失                      │
│ • 维护期间无法监控                          │
│ • 升级时服务中断                            │
└─────────────────────────────────────────────┘

高可用目标：
├── 消除单点故障
├── 数据冗余
├── 故障自动转移
└── 无缝升级

1.2 高可用策略 #

text

高可用策略：

┌─────────────────────────────────────────────┐
│ 1. 双活部署                                 │
├─────────────────────────────────────────────┤
│ • 两个Prometheus实例同时运行                │
│ • 独立采集数据                              │
│ • 通过Alertmanager去重                      │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 2. 分片部署                                 │
├─────────────────────────────────────────────┤
│ • 多个实例分片采集                          │
│ • 每个实例采集部分目标                      │
│ • 通过联邦聚合                              │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 3. 远程存储                                 │
├─────────────────────────────────────────────┤
│ • 数据写入远程存储                          │
│ • 存储层高可用                              │
│ • 查询层可扩展                              │
└─────────────────────────────────────────────┘

二、双活部署 #

2.1 架构设计 #

text

双活架构：

┌─────────────────────────────────────────────────────────┐
│                      负载均衡器                          │
│                   (Load Balancer)                        │
│                         │                                │
│         ┌───────────────┴───────────────┐               │
│         ▼                               ▼               │
│ ┌─────────────────┐           ┌─────────────────┐       │
│ │ Prometheus A    │           │ Prometheus B    │       │
│ │                 │           │                 │       │
│ │ • 采集所有目标  │           │ • 采集所有目标  │       │
│ │ • 独立存储      │           │ • 独立存储      │       │
│ │ • 独立告警      │           │ • 独立告警      │       │
│ └────────┬────────┘           └────────┬────────┘       │
│          │                             │                 │
│          └──────────────┬──────────────┘                 │
│                         ▼                                │
│              ┌─────────────────────┐                     │
│              │   Alertmanager      │                     │
│              │   (Cluster Mode)    │                     │
│              │                     │                     │
│              │ • 告警去重          │                     │
│              │ • 告警分组          │                     │
│              └─────────────────────┘                     │
└─────────────────────────────────────────────────────────┘

2.2 Prometheus配置 #

yaml

# prometheus-a.yml

global:
  external_labels:
    prometheus: 'prometheus-a'

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
          - 'node3:9100'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager1:9093'
          - 'alertmanager2:9093'

yaml

# prometheus-b.yml

global:
  external_labels:
    prometheus: 'prometheus-b'

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
          - 'node3:9100'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager1:9093'
          - 'alertmanager2:9093'

2.3 Alertmanager集群 #

bash

# 启动Alertmanager集群节点1
alertmanager \
    --config.file=alertmanager.yml \
    --storage.path=/var/lib/alertmanager \
    --cluster.listen-address=0.0.0.0:9094 \
    --cluster.peer=alertmanager2:9094 \
    --web.listen-address=:9093

# 启动Alertmanager集群节点2
alertmanager \
    --config.file=alertmanager.yml \
    --storage.path=/var/lib/alertmanager \
    --cluster.listen-address=0.0.0.0:9094 \
    --cluster.peer=alertmanager1:9094 \
    --web.listen-address=:9093

三、分片部署 #

3.1 架构设计 #

text

分片架构：

┌─────────────────────────────────────────────────────────┐
│                    全局视图层                            │
│                  (Global View)                           │
│                         │                                │
│         ┌───────────────┼───────────────┐               │
│         ▼               ▼               ▼               │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐        │
│ │ Prometheus  │ │ Prometheus  │ │ Prometheus  │        │
│ │ Shard 1     │ │ Shard 2     │ │ Shard 3     │        │
│ │             │ │             │ │             │        │
│ │ 采集目标:   │ │ 采集目标:   │ │ 采集目标:   │        │
│ │ node1-10    │ │ node11-20   │ │ node21-30   │        │
│ └─────────────┘ └─────────────┘ └─────────────┘        │
└─────────────────────────────────────────────────────────┘

3.2 分片配置 #

yaml

# prometheus-shard1.yml

global:
  external_labels:
    shard: '1'

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
          - 'node3:9100'
          - 'node4:9100'
          - 'node5:9100'
          - 'node6:9100'
          - 'node7:9100'
          - 'node8:9100'
          - 'node9:9100'
          - 'node10:9100'

3.3 联邦聚合 #

yaml

# prometheus-global.yml

scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
          - 'prometheus-shard1:9090'
          - 'prometheus-shard2:9090'
          - 'prometheus-shard3:9090'

四、Thanos高可用 #

4.1 架构设计 #

text

Thanos高可用架构：

┌─────────────────────────────────────────────────────────┐
│                    Thanos Query                          │
│                  (查询入口，可多实例)                    │
│                         │                                │
│         ┌───────────────┼───────────────┐               │
│         ▼               ▼               ▼               │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐        │
│ │ Prometheus  │ │ Prometheus  │ │ Prometheus  │        │
│ │ + Sidecar A │ │ + Sidecar B │ │ + Sidecar C │        │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘        │
│        │               │               │                 │
│        └───────────────┼───────────────┘                 │
│                        ▼                                 │
│              ┌─────────────────────┐                     │
│              │   对象存储          │                     │
│              │   (Object Storage)  │                     │
│              │   • S3              │                     │
│              │   • GCS             │                     │
│              │   • MinIO           │                     │
│              └─────────────────────┘                     │
└─────────────────────────────────────────────────────────┘

4.2 部署配置 #

yaml

# docker-compose.yml

services:
  prometheus-a:
    image: prom/prometheus:v2.48.0
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=2h'
      - '--storage.tsdb.max-block-duration=2h'
      - '--external-labels=prometheus=a'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_a_data:/prometheus

  thanos-sidecar-a:
    image: thanosio/thanos:v0.32.0
    command:
      - 'sidecar'
      - '--tsdb.path=/prometheus'
      - '--prometheus.url=http://prometheus-a:9090'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
    volumes:
      - prometheus_a_data:/prometheus
      - ./bucket.yml:/etc/thanos/bucket.yml

  prometheus-b:
    image: prom/prometheus:v2.48.0
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=2h'
      - '--storage.tsdb.max-block-duration=2h'
      - '--external-labels=prometheus=b'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_b_data:/prometheus

  thanos-sidecar-b:
    image: thanosio/thanos:v0.32.0
    command:
      - 'sidecar'
      - '--tsdb.path=/prometheus'
      - '--prometheus.url=http://prometheus-b:9090'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
    volumes:
      - prometheus_b_data:/prometheus
      - ./bucket.yml:/etc/thanos/bucket.yml

  thanos-query:
    image: thanosio/thanos:v0.32.0
    command:
      - 'query'
      - '--store=thanos-sidecar-a:10901'
      - '--store=thanos-sidecar-b:10901'
      - '--store=thanos-store:10901'
    ports:
      - "19192:10902"

  thanos-store:
    image: thanosio/thanos:v0.32.0
    command:
      - 'store'
      - '--data-dir=/data'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
    volumes:
      - thanos_store_data:/data
      - ./bucket.yml:/etc/thanos/bucket.yml

volumes:
  prometheus_a_data:
  prometheus_b_data:
  thanos_store_data:

五、故障转移 #

5.1 负载均衡配置 #

nginx

# nginx.conf

upstream prometheus {
    server prometheus-a:9090;
    server prometheus-b:9090;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://prometheus;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

5.2 健康检查 #

yaml

# Prometheus健康检查

# HTTP健康检查
curl http://localhost:9090/-/healthy

# 就绪检查
curl http://localhost:9090/-/ready

# Kubernetes探针配置
livenessProbe:
  httpGet:
    path: /-/healthy
    port: 9090
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /-/ready
    port: 9090
  initialDelaySeconds: 5
  periodSeconds: 5

六、总结 #

高可用策略：

策略	说明	适用场景
双活部署	两个实例同时运行	中小规模
分片部署	多实例分片采集	大规模
Thanos	远程存储+全局查询	长期存储

关键组件：

组件	作用
Alertmanager集群	告警去重
负载均衡	流量分发
健康检查	故障检测

下一步，让我们学习运维管理！