监控与运维 #

本章详细介绍 Weaviate 的监控与运维。

监控架构 #

text

Weaviate 监控架构：

┌─────────────────────────────────────────────────────────────┐
│                      监控系统                                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                    Grafana                          │   │
│  │              (可视化仪表盘)                          │   │
│  └─────────────────────────────────────────────────────┘   │
│                           │                                  │
│                           ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   Prometheus                        │   │
│  │              (指标收集存储)                          │   │
│  └─────────────────────────────────────────────────────┘   │
│                           │                                  │
│                           ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   Weaviate                          │   │
│  │              /v1/metrics 端点                       │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   日志系统                           │   │
│  │          ELK / Loki / stdout                        │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

健康检查 #

HTTP 健康检查端点 #

bash

curl http://localhost:8080/v1/.well-known/ready

curl http://localhost:8080/v1/.well-known/live

curl http://localhost:8080/v1/.well-known/openid

Python 健康检查 #

python

import weaviate

client = weaviate.connect_to_local()

print(f"Ready: {client.is_ready()}")
print(f"Live: {client.is_live()}")

meta = client.get_meta()
print(f"Version: {meta['version']}")
print(f"Hostname: {meta['hostname']}")

集群健康检查 #

python

nodes = client.cluster.nodes()

for node in nodes:
    print(f"Node: {node.name}")
    print(f"  Status: {node.status}")
    print(f"  Version: {node.version}")
    print(f"  Uptime: {node.uptime_seconds}s")

Prometheus 指标 #

启用 Prometheus 指标 #

yaml

environment:
  PROMETHEUS_MONITORING_ENABLED: 'true'
  PROMETHEUS_MONITORING_PORT: '2112'

指标端点 #

bash

curl http://localhost:2112/metrics

关键指标 #

text

Weaviate 关键指标：

查询性能：
├── weaviate_query_duration_seconds
├── weaviate_query_count_total
└── weaviate_query_errors_total

索引性能：
├── weaviate_import_duration_seconds
├── weaviate_import_count_total
└── weaviate_import_errors_total

存储：
├── weaviate_object_count
├── weaviate_shard_count
└── weaviate_index_size_bytes

资源：
├── weaviate_memory_usage_bytes
├── weaviate_goroutine_count
└── weaviate_gc_duration_seconds

Prometheus 配置 #

yaml

scrape_configs:
  - job_name: 'weaviate'
    static_configs:
      - targets: ['weaviate:2112']
    metrics_path: '/metrics'
    scrape_interval: 15s

Grafana 仪表盘 #

导入仪表盘 #

Weaviate 提供官方 Grafana 仪表盘：

访问 Grafana
导入仪表盘
使用 ID: weaviate-official-dashboard

自定义仪表盘 #

json

{
  "dashboard": {
    "title": "Weaviate Monitoring",
    "panels": [
      {
        "title": "Query Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(weaviate_query_duration_seconds_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "title": "QPS",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(weaviate_query_count_total[1m])",
            "legendFormat": "QPS"
          }
        ]
      },
      {
        "title": "Object Count",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(weaviate_object_count)",
            "legendFormat": "Total Objects"
          }
        ]
      }
    ]
  }
}

日志管理 #

日志级别配置 #

yaml

environment:
  LOG_LEVEL: 'info'
  LOG_FORMAT: 'json'

日志级别 #

text

日志级别：

debug:
├── 详细调试信息
├── 开发环境使用
└── 生产环境不推荐

info:
├── 一般信息
├── 默认级别
└── 推荐生产环境

warn:
├── 警告信息
└── 仅记录警告和错误

error:
├── 错误信息
└── 仅记录错误

日志格式 #

json

{
  "level": "info",
  "msg": "request completed",
  "method": "GET",
  "path": "/v1/objects",
  "status": 200,
  "duration_ms": 15,
  "remote_addr": "192.168.1.1",
  "time": "2024-01-15T10:00:00Z"
}

日志收集配置 #

yaml

version: '3.8'
services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.25.0
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "5"

性能监控 #

查询性能监控 #

python

import time
import weaviate

client = weaviate.connect_to_local()
articles = client.collections.get("Article")

start_time = time.time()

response = articles.query.near_text(
    query="向量数据库",
    limit=10
)

end_time = time.time()

print(f"Query time: {(end_time - start_time) * 1000:.2f}ms")
print(f"Results: {len(response.objects)}")

批量导入监控 #

python

import time

start_time = time.time()

with articles.batch.fixed_size(batch_size=100) as batch:
    for i in range(10000):
        batch.add_object(
            properties={
                "title": f"文章 {i}",
                "content": f"内容 {i}"
            }
        )

end_time = time.time()

print(f"Import time: {end_time - start_time:.2f}s")
print(f"Throughput: {10000 / (end_time - start_time):.0f} objects/s")

内存监控 #

python

import psutil
import os

process = psutil.Process(os.getpid())

memory_info = process.memory_info()
print(f"Memory usage: {memory_info.rss / 1024 / 1024:.2f} MB")

故障排查 #

常见问题 #

text

常见问题排查：

1. 连接失败
   ├── 检查服务是否启动
   ├── 检查端口是否开放
   └── 检查网络配置

2. 查询超时
   ├── 检查查询复杂度
   ├── 检查索引配置
   └── 检查资源使用

3. 内存不足
   ├── 检查数据规模
   ├── 启用量化
   └── 增加内存配置

4. 写入失败
   ├── 检查磁盘空间
   ├── 检查权限
   └── 检查一致性级别

诊断命令 #

bash

docker logs weaviate --tail 100

docker exec weaviate curl localhost:8080/v1/nodes

docker stats weaviate

curl http://localhost:8080/v1/meta

性能分析 #

python

import cProfile
import pstats

def profile_query():
    articles = client.collections.get("Article")
    response = articles.query.near_text(
        query="向量数据库",
        limit=10
    )

profiler = cProfile.Profile()
profiler.enable()

profile_query()

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)

运维操作 #

数据清理 #

python

from weaviate.classes.query import Filter

articles = client.collections.get("Article")

result = articles.data.delete_many(
    where=Filter.by_property("createdAt").less_than("2024-01-01")
)

print(f"Deleted {result.successful} objects")

索引重建 #

python

articles = client.collections.get("Article")

config = articles.config.get()

client.collections.delete("Article")

client.collections.create(
    name="Article",
    vectorizer_config=config.vectorizer_config,
    properties=config.properties
)

数据导出 #

python

import json

articles = client.collections.get("Article")

all_objects = []
offset = 0
batch_size = 100

while True:
    response = articles.query.fetch_objects(
        limit=batch_size,
        offset=offset
    )
    
    if not response.objects:
        break
    
    for obj in response.objects:
        all_objects.append({
            "id": str(obj.uuid),
            "properties": obj.properties
        })
    
    offset += batch_size

with open("export.json", "w", encoding="utf-8") as f:
    json.dump(all_objects, f, ensure_ascii=False, indent=2)

print(f"Exported {len(all_objects)} objects")

数据导入 #

python

import json

with open("export.json", "r", encoding="utf-8") as f:
    objects = json.load(f)

articles = client.collections.get("Article")

with articles.batch.dynamic() as batch:
    for obj in objects:
        batch.add_object(
            properties=obj["properties"],
            uuid=obj["id"]
        )

print(f"Imported {len(objects)} objects")

告警配置 #

Prometheus 告警规则 #

yaml

groups:
  - name: weaviate
    rules:
      - alert: WeaviateHighLatency
        expr: histogram_quantile(0.99, rate(weaviate_query_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Weaviate high query latency"
          description: "P99 latency is above 1s"

      - alert: WeaviateHighErrorRate
        expr: rate(weaviate_query_errors_total[5m]) / rate(weaviate_query_count_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Weaviate high error rate"
          description: "Error rate is above 1%"

      - alert: WeaviateNodeDown
        expr: up{job="weaviate"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Weaviate node is down"
          description: "Node {{ $labels.instance }} is down"

Alertmanager 配置 #

yaml

route:
  receiver: 'team-notifications'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h

receivers:
  - name: 'team-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        send_resolved: true

小结 #

本章介绍了 Weaviate 的监控与运维：

健康检查
Prometheus 指标
Grafana 仪表盘
日志管理
性能监控
故障排查
运维操作
告警配置

下一步 #

继续学习语义搜索，开始实战应用！