Elasticsearch监控告警 #

一、监控概述 #

1.1 监控层次 #

text

监控层次
├── 集群监控
│   ├── 集群状态
│   ├── 节点状态
│   └── 分片状态
├── 节点监控
│   ├── 资源使用
│   ├── JVM状态
│   └── 线程池
├── 索引监控
│   ├── 文档数量
│   ├── 存储大小
│   └── 操作统计
└── 查询监控
    ├── 查询延迟
    ├── 查询吞吐
    └── 慢查询

1.2 监控工具 #

工具	说明
Kibana Stack Monitoring	官方监控方案
Prometheus + Grafana	开源监控方案
Elasticsearch Exporter	Prometheus导出器
Elastic Agent	统一数据采集

二、Kibana监控 #

2.1 启用监控 #

yaml

xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true

2.2 配置Kibana #

yaml

monitoring.ui.container.elasticsearch.enabled: true
elasticsearch.hosts: ["http://localhost:9200"]
elasticsearch.username: "kibana_system"
elasticsearch.password: "password"

2.3 访问监控 #

打开Kibana，进入 Stack Management > Monitoring。

2.4 监控指标 #

text

Kibana监控指标
├── 集群概览
│   ├── 状态
│   ├── 节点数
│   └── 分片数
├── 节点指标
│   ├── CPU使用率
│   ├── 内存使用率
│   ├── 磁盘使用率
│   └── JVM堆使用率
├── 索引指标
│   ├── 索引速率
│   ├── 搜索速率
│   └── 文档数
└── 性能指标
    ├── 查询延迟
    └── 索引延迟

三、Prometheus监控 #

3.1 安装Elasticsearch Exporter #

bash

docker run -d \
  --name elasticsearch_exporter \
  -p 9114:9114 \
  -e ES_URI=http://localhost:9200 \
  quay.io/prometheuscommunity/elasticsearch-exporter:latest

3.2 Prometheus配置 #

yaml

scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['localhost:9114']

3.3 关键指标 #

指标	说明
elasticsearch_cluster_health_status	集群状态
elasticsearch_cluster_health_number_of_nodes	节点数量
elasticsearch_cluster_health_number_of_data_nodes	数据节点数量
elasticsearch_jvm_memory_used_bytes	JVM内存使用
elasticsearch_process_cpu_percent	CPU使用率
elasticsearch_fs_total_total_in_bytes	磁盘总大小
elasticsearch_fs_total_available_in_bytes	磁盘可用大小

3.4 Grafana Dashboard #

导入Elasticsearch Dashboard：ID 2322 或 4358。

四、告警配置 #

4.1 Kibana告警 #

进入 Stack Management > Alerts and Insights > Rules。

集群状态告警：

json

{
  "rule_type_id": ".es-query",
  "name": "Cluster Health Alert",
  "params": {
    "index": ".monitoring-es-*",
    "query": {
      "bool": {
        "filter": [
          { "term": { "cluster_stats.status": "red" } }
        ]
      }
    }
  }
}

4.2 Prometheus告警规则 #

yaml

groups:
  - name: elasticsearch
    rules:
      - alert: ElasticsearchClusterRed
        expr: elasticsearch_cluster_health_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Elasticsearch cluster is red"
          
      - alert: ElasticsearchClusterYellow
        expr: elasticsearch_cluster_health_status{color="yellow"} == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elasticsearch cluster is yellow"
          
      - alert: ElasticsearchHeapTooHigh
        expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elasticsearch heap usage is high"
          
      - alert: ElasticsearchDiskSpaceLow
        expr: elasticsearch_fs_total_available_in_bytes / elasticsearch_fs_total_total_in_bytes < 0.15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Elasticsearch disk space is low"

4.3 Elasticsearch Watcher #

bash

PUT /_watcher/watch/cluster_health
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "http": {
      "request": {
        "host": "localhost",
        "port": 9200,
        "path": "/_cluster/health"
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.status": {
        "not_eq": "green"
      }
    }
  },
  "actions": {
    "email_admin": {
      "email": {
        "to": "admin@example.com",
        "subject": "Elasticsearch Cluster Alert",
        "body": "Cluster status is {{ctx.payload.status}}"
      }
    }
  }
}

五、关键监控指标 #

5.1 集群指标 #

指标	告警阈值
集群状态	!= green
节点数量	< 预期数量
未分配分片	> 0

5.2 节点指标 #

指标	告警阈值
CPU使用率	> 80%
堆内存使用率	> 85%
磁盘使用率	> 85%
老年代GC频率	> 5次/分钟

5.3 索引指标 #

指标	告警阈值
索引速率	异常下降
搜索延迟	> 1秒
拒绝数	> 0

5.4 JVM指标 #

指标	告警阈值
堆使用率	> 85%
GC时间占比	> 10%
线程数	> 800

六、日志监控 #

6.1 日志配置 #

yaml

path.logs: /var/log/elasticsearch

logger.level: info

6.2 日志收集 #

使用Filebeat收集日志：

yaml

filebeat.inputs:
  - type: log
    paths:
      - /var/log/elasticsearch/*.log
    fields:
      type: elasticsearch

output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "elasticsearch-logs-%{+yyyy.MM.dd}"

6.3 日志分析 #

在Kibana中创建日志分析Dashboard。

七、监控最佳实践 #

7.1 监控策略 #

text

监控策略
├── 分层监控
│   ├── 集群层
│   ├── 节点层
│   └── 应用层
├── 告警分级
│   ├── 严重：立即处理
│   ├── 警告：关注处理
│   └── 信息：记录观察
└── 定期审查
    └── 定期检查告警规则

7.2 告警最佳实践 #

text

告警最佳实践
├── 避免告警风暴
│   └── 合理设置阈值和持续时间
├── 告警可操作
│   └── 提供处理建议
├── 告警分级
│   └── 不同级别不同处理
└── 定期测试
    └── 确保告警有效

7.3 监控仪表盘 #

text

仪表盘建议
├── 集群概览
│   ├── 状态、节点数、分片数
│   └── 存储使用、查询吞吐
├── 节点详情
│   ├── CPU、内存、磁盘
│   └── JVM、线程池
├── 索引详情
│   ├── 文档数、存储大小
│   └── 操作统计
└── 性能趋势
    ├── 查询延迟趋势
    └── 索引速率趋势

八、故障排查 #

8.1 常见问题 #

问题	排查步骤
集群状态Red	检查未分配分片、节点状态
查询慢	检查查询语句、索引设计
内存高	检查字段数据缓存、查询
磁盘满	清理数据、扩容存储

8.2 排查工具 #

bash

GET /_cluster/health

GET /_cluster/allocation/explain

GET /_nodes/hot_threads

GET /_cat/pending_tasks?v

九、总结 #

本章介绍了Elasticsearch监控告警：

Kibana提供官方监控方案
Prometheus + Grafana是流行选择
关键指标需要重点关注
合理配置告警规则
分层监控全面覆盖
定期审查优化监控策略

至此，Elasticsearch完全指南全部完成！