监控与运维 #
本章详细介绍 Weaviate 的监控与运维。
监控架构 #
text
Weaviate 监控架构:
┌─────────────────────────────────────────────────────────────┐
│ 监控系统 │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Grafana │ │
│ │ (可视化仪表盘) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Prometheus │ │
│ │ (指标收集存储) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Weaviate │ │
│ │ /v1/metrics 端点 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ 日志系统 │ │
│ │ ELK / Loki / stdout │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
健康检查 #
HTTP 健康检查端点 #
bash
curl http://localhost:8080/v1/.well-known/ready
curl http://localhost:8080/v1/.well-known/live
curl http://localhost:8080/v1/.well-known/openid
Python 健康检查 #
python
import weaviate
client = weaviate.connect_to_local()
print(f"Ready: {client.is_ready()}")
print(f"Live: {client.is_live()}")
meta = client.get_meta()
print(f"Version: {meta['version']}")
print(f"Hostname: {meta['hostname']}")
集群健康检查 #
python
nodes = client.cluster.nodes()
for node in nodes:
print(f"Node: {node.name}")
print(f" Status: {node.status}")
print(f" Version: {node.version}")
print(f" Uptime: {node.uptime_seconds}s")
Prometheus 指标 #
启用 Prometheus 指标 #
yaml
environment:
PROMETHEUS_MONITORING_ENABLED: 'true'
PROMETHEUS_MONITORING_PORT: '2112'
指标端点 #
bash
curl http://localhost:2112/metrics
关键指标 #
text
Weaviate 关键指标:
查询性能:
├── weaviate_query_duration_seconds
├── weaviate_query_count_total
└── weaviate_query_errors_total
索引性能:
├── weaviate_import_duration_seconds
├── weaviate_import_count_total
└── weaviate_import_errors_total
存储:
├── weaviate_object_count
├── weaviate_shard_count
└── weaviate_index_size_bytes
资源:
├── weaviate_memory_usage_bytes
├── weaviate_goroutine_count
└── weaviate_gc_duration_seconds
Prometheus 配置 #
yaml
scrape_configs:
- job_name: 'weaviate'
static_configs:
- targets: ['weaviate:2112']
metrics_path: '/metrics'
scrape_interval: 15s
Grafana 仪表盘 #
导入仪表盘 #
Weaviate 提供官方 Grafana 仪表盘:
- 访问 Grafana
- 导入仪表盘
- 使用 ID:
weaviate-official-dashboard
自定义仪表盘 #
json
{
"dashboard": {
"title": "Weaviate Monitoring",
"panels": [
{
"title": "Query Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(weaviate_query_duration_seconds_bucket[5m]))",
"legendFormat": "P99"
}
]
},
{
"title": "QPS",
"type": "graph",
"targets": [
{
"expr": "rate(weaviate_query_count_total[1m])",
"legendFormat": "QPS"
}
]
},
{
"title": "Object Count",
"type": "stat",
"targets": [
{
"expr": "sum(weaviate_object_count)",
"legendFormat": "Total Objects"
}
]
}
]
}
}
日志管理 #
日志级别配置 #
yaml
environment:
LOG_LEVEL: 'info'
LOG_FORMAT: 'json'
日志级别 #
text
日志级别:
debug:
├── 详细调试信息
├── 开发环境使用
└── 生产环境不推荐
info:
├── 一般信息
├── 默认级别
└── 推荐生产环境
warn:
├── 警告信息
└── 仅记录警告和错误
error:
├── 错误信息
└── 仅记录错误
日志格式 #
json
{
"level": "info",
"msg": "request completed",
"method": "GET",
"path": "/v1/objects",
"status": 200,
"duration_ms": 15,
"remote_addr": "192.168.1.1",
"time": "2024-01-15T10:00:00Z"
}
日志收集配置 #
yaml
version: '3.8'
services:
weaviate:
image: cr.weaviate.io/semitechnologies/weaviate:1.25.0
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"
性能监控 #
查询性能监控 #
python
import time
import weaviate
client = weaviate.connect_to_local()
articles = client.collections.get("Article")
start_time = time.time()
response = articles.query.near_text(
query="向量数据库",
limit=10
)
end_time = time.time()
print(f"Query time: {(end_time - start_time) * 1000:.2f}ms")
print(f"Results: {len(response.objects)}")
批量导入监控 #
python
import time
start_time = time.time()
with articles.batch.fixed_size(batch_size=100) as batch:
for i in range(10000):
batch.add_object(
properties={
"title": f"文章 {i}",
"content": f"内容 {i}"
}
)
end_time = time.time()
print(f"Import time: {end_time - start_time:.2f}s")
print(f"Throughput: {10000 / (end_time - start_time):.0f} objects/s")
内存监控 #
python
import psutil
import os
process = psutil.Process(os.getpid())
memory_info = process.memory_info()
print(f"Memory usage: {memory_info.rss / 1024 / 1024:.2f} MB")
故障排查 #
常见问题 #
text
常见问题排查:
1. 连接失败
├── 检查服务是否启动
├── 检查端口是否开放
└── 检查网络配置
2. 查询超时
├── 检查查询复杂度
├── 检查索引配置
└── 检查资源使用
3. 内存不足
├── 检查数据规模
├── 启用量化
└── 增加内存配置
4. 写入失败
├── 检查磁盘空间
├── 检查权限
└── 检查一致性级别
诊断命令 #
bash
docker logs weaviate --tail 100
docker exec weaviate curl localhost:8080/v1/nodes
docker stats weaviate
curl http://localhost:8080/v1/meta
性能分析 #
python
import cProfile
import pstats
def profile_query():
articles = client.collections.get("Article")
response = articles.query.near_text(
query="向量数据库",
limit=10
)
profiler = cProfile.Profile()
profiler.enable()
profile_query()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)
运维操作 #
数据清理 #
python
from weaviate.classes.query import Filter
articles = client.collections.get("Article")
result = articles.data.delete_many(
where=Filter.by_property("createdAt").less_than("2024-01-01")
)
print(f"Deleted {result.successful} objects")
索引重建 #
python
articles = client.collections.get("Article")
config = articles.config.get()
client.collections.delete("Article")
client.collections.create(
name="Article",
vectorizer_config=config.vectorizer_config,
properties=config.properties
)
数据导出 #
python
import json
articles = client.collections.get("Article")
all_objects = []
offset = 0
batch_size = 100
while True:
response = articles.query.fetch_objects(
limit=batch_size,
offset=offset
)
if not response.objects:
break
for obj in response.objects:
all_objects.append({
"id": str(obj.uuid),
"properties": obj.properties
})
offset += batch_size
with open("export.json", "w", encoding="utf-8") as f:
json.dump(all_objects, f, ensure_ascii=False, indent=2)
print(f"Exported {len(all_objects)} objects")
数据导入 #
python
import json
with open("export.json", "r", encoding="utf-8") as f:
objects = json.load(f)
articles = client.collections.get("Article")
with articles.batch.dynamic() as batch:
for obj in objects:
batch.add_object(
properties=obj["properties"],
uuid=obj["id"]
)
print(f"Imported {len(objects)} objects")
告警配置 #
Prometheus 告警规则 #
yaml
groups:
- name: weaviate
rules:
- alert: WeaviateHighLatency
expr: histogram_quantile(0.99, rate(weaviate_query_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Weaviate high query latency"
description: "P99 latency is above 1s"
- alert: WeaviateHighErrorRate
expr: rate(weaviate_query_errors_total[5m]) / rate(weaviate_query_count_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Weaviate high error rate"
description: "Error rate is above 1%"
- alert: WeaviateNodeDown
expr: up{job="weaviate"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Weaviate node is down"
description: "Node {{ $labels.instance }} is down"
Alertmanager 配置 #
yaml
route:
receiver: 'team-notifications'
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receivers:
- name: 'team-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
send_resolved: true
小结 #
本章介绍了 Weaviate 的监控与运维:
- 健康检查
- Prometheus 指标
- Grafana 仪表盘
- 日志管理
- 性能监控
- 故障排查
- 运维操作
- 告警配置
下一步 #
继续学习 语义搜索,开始实战应用!
最后更新:2026-04-04