健康检查 #
一、健康检查概述 #
1.1 什么是健康检查 #
健康检查(Health Check)是Varnish主动探测后端服务器状态的功能,用于:
- 检测后端是否可用
- 自动剔除故障服务器
- 自动恢复健康服务器
- 实现故障转移
1.2 工作原理 #
text
┌─────────────────────────────────────────────────────────┐
│ 健康检查工作原理 │
├─────────────────────────────────────────────────────────┤
│ │
│ Varnish ──► Probe Request ──► Backend Server │
│ │ │
│ │ ◄── Response ──┘ │
│ │ │
│ ├──► 成功 (200 OK) ──► 标记为 Healthy │
│ │ │
│ └──► 失败 (超时/错误) ──► 标记为 Sick │
│ │
└─────────────────────────────────────────────────────────┘
二、基本配置 #
2.1 简单健康检查 #
vcl
backend default {
.host = "192.168.1.10";
.port = "80";
.probe = {
.url = "/healthcheck";
}
}
2.2 完整配置 #
vcl
backend default {
.host = "192.168.1.10";
.port = "80";
.probe = {
.url = "/healthcheck";
.timeout = 2s;
.interval = 5s;
.window = 5;
.threshold = 3;
.initial = 2;
.expected_response = 200;
}
}
2.3 参数说明 #
| 参数 | 默认值 | 说明 |
|---|---|---|
| url | / | 探测URL路径 |
| timeout | 2s | 探测超时时间 |
| interval | 5s | 探测间隔 |
| window | 8 | 滑动窗口大小 |
| threshold | 3 | 健康阈值 |
| initial | 0 | 初始延迟 |
| expected_response | 200 | 期望响应码 |
三、高级配置 #
3.1 自定义请求 #
vcl
backend default {
.host = "192.168.1.10";
.port = "80";
.probe = {
.request =
"GET /healthcheck HTTP/1.1"
"Host: example.com"
"User-Agent: Varnish Health Check"
"Connection: close"
"";
.timeout = 2s;
.interval = 5s;
}
}
3.2 POST探测 #
vcl
backend api {
.host = "192.168.1.20";
.port = "8080";
.probe = {
.request =
"POST /api/health HTTP/1.1"
"Host: api.example.com"
"Content-Type: application/json"
"Content-Length: 15"
""
"{"check":"true"}";
.timeout = 5s;
.interval = 10s;
}
}
3.3 带认证探测 #
vcl
backend default {
.host = "192.168.1.10";
.port = "80";
.probe = {
.request =
"GET /healthcheck HTTP/1.1"
"Host: example.com"
"Authorization: Basic dXNlcjpwYXNz"
"";
.timeout = 2s;
.interval = 5s;
}
}
四、共享探测配置 #
4.1 定义探测模板 #
vcl
probe healthcheck {
.url = "/healthcheck";
.timeout = 2s;
.interval = 5s;
.window = 5;
.threshold = 3;
}
backend server1 {
.host = "192.168.1.10";
.port = "80";
.probe = healthcheck;
}
backend server2 {
.host = "192.168.1.11";
.port = "80";
.probe = healthcheck;
}
backend server3 {
.host = "192.168.1.12";
.port = "80";
.probe = healthcheck;
}
4.2 不同探测配置 #
vcl
probe web_probe {
.url = "/healthcheck";
.timeout = 2s;
.interval = 5s;
}
probe api_probe {
.url = "/api/health";
.timeout = 5s;
.interval = 10s;
}
backend web1 {
.host = "192.168.1.10";
.port = "80";
.probe = web_probe;
}
backend api1 {
.host = "192.168.2.10";
.port = "8080";
.probe = api_probe;
}
五、健康状态判断 #
5.1 滑动窗口机制 #
text
探测结果序列: [成功, 成功, 失败, 成功, 成功]
窗口大小(window): 5
健康阈值(threshold): 3
最近5次探测中成功次数 >= 3 → Healthy
最近5次探测中成功次数 < 3 → Sick
5.2 状态转换 #
text
┌─────────────────────────────────────────────────────────┐
│ 状态转换 │
├─────────────────────────────────────────────────────────┤
│ │
│ Healthy ──► 成功次数 < threshold ──► Sick │
│ │
│ Sick ──► 成功次数 >= threshold ──► Healthy │
│ │
│ 新后端 ──► initial延迟后开始探测 │
│ │
└─────────────────────────────────────────────────────────┘
5.3 查看健康状态 #
bash
# 查看后端列表
varnishadm backend.list
# 输出示例
name ref probe health
server1 1 5/5 healthy
server2 1 3/5 healthy
server3 1 1/5 sick
# 详细信息
varnishadm backend.list -p
# 输出示例
server1:
health: healthy
probe: 5/5
last check: 2 seconds ago
last change: 5 minutes ago
六、手动管理状态 #
6.1 设置健康状态 #
bash
# 设置为健康
varnishadm backend.set_health server1 healthy
# 设置为不健康
varnishadm backend.set_health server1 sick
# 设置为自动(使用探测结果)
varnishadm backend.set_health server1 auto
6.2 VCL中检查状态 #
vcl
import std;
sub vcl_recv {
# 检查后端健康状态
if (!std.healthy(req.backend_hint)) {
# 后端不健康,返回错误
return (synth(503, "Backend Unavailable"));
}
}
七、故障转移 #
7.1 使用fallback调度器 #
vcl
import directors;
backend primary {
.host = "192.168.1.10";
.port = "80";
.probe = healthcheck;
}
backend secondary {
.host = "192.168.1.11";
.port = "80";
.probe = healthcheck;
}
sub vcl_init {
new cluster = directors.fallback();
cluster.add_backend(primary);
cluster.add_backend(secondary);
}
sub vcl_recv {
set req.backend_hint = cluster.backend();
}
7.2 自定义故障转移 #
vcl
backend primary {
.host = "192.168.1.10";
.port = "80";
.probe = healthcheck;
}
backend secondary {
.host = "192.168.1.11";
.port = "80";
.probe = healthcheck;
}
sub vcl_recv {
if (std.healthy(primary)) {
set req.backend_hint = primary;
} else {
set req.backend_hint = secondary;
}
}
7.3 重试机制 #
vcl
sub vcl_backend_response {
# 后端错误时重试
if (beresp.status >= 500 && bereq.retries < 3) {
return (retry);
}
}
sub vcl_backend_error {
# 切换到备用后端
if (bereq.retries < 3) {
set bereq.backend = secondary;
return (retry);
}
}
八、健康检查监控 #
8.1 统计信息 #
bash
# 查看健康检查统计
varnishstat -1 -f MAIN.backend_*
# 输出示例
MAIN.backend_conn 1234 Backend connections
MAIN.backend_unhealthy 5 Backend unhealthy events
MAIN.backend_busy 10 Backend busy events
MAIN.backend_fail 3 Backend failures
MAIN.backend_reuse 5678 Backend connection reuse
MAIN.backend_recycle 5678 Backend connection recycle
MAIN.backend_toolate 0 Backend connection too late
8.2 探测日志 #
bash
# 查看探测日志
varnishlog -q "BackendHealth"
# 输出示例
- BackendHealth default healthy 5/5
- BackendHealth server1 healthy 5/5
- BackendHealth server2 sick 1/5
8.3 监控脚本 #
bash
#!/bin/bash
# health_monitor.sh
echo "=== Backend Health Status ==="
echo ""
varnishadm backend.list -p
echo ""
echo "=== Health Check Statistics ==="
varnishstat -1 -f MAIN.backend_*
echo ""
echo "=== Recent Health Changes ==="
varnishlog -d -q "BackendHealth" | tail -20
九、健康检查最佳实践 #
9.1 探测端点设计 #
健康检查端点应该:
python
# 好的健康检查端点示例
@app.route('/healthcheck')
def healthcheck():
# 检查数据库连接
if not database.is_connected():
return 'Database unavailable', 503
# 检查缓存连接
if not cache.is_connected():
return 'Cache unavailable', 503
# 检查关键服务
if not critical_service.is_available():
return 'Service unavailable', 503
return 'OK', 200
避免:
python
# 不好的示例 - 总是返回200
@app.route('/healthcheck')
def healthcheck():
return 'OK', 200
# 不好的示例 - 执行耗时操作
@app.route('/healthcheck')
def healthcheck():
# 执行复杂检查
run_full_diagnostics()
return 'OK', 200
9.2 参数调优 #
| 场景 | interval | timeout | window | threshold |
|---|---|---|---|---|
| 高可用 | 2s | 1s | 5 | 3 |
| 一般场景 | 5s | 2s | 5 | 3 |
| 容忍度高 | 10s | 5s | 8 | 3 |
9.3 避免雪崩 #
vcl
# 使用随机延迟避免同时探测
probe healthcheck {
.url = "/healthcheck";
.interval = 5s;
.initial = 1s; # 初始延迟
}
十、完整配置示例 #
vcl
vcl 4.1;
import directors;
import std;
# 探测配置
probe web_probe {
.url = "/healthcheck";
.timeout = 2s;
.interval = 5s;
.window = 5;
.threshold = 3;
.expected_response = 200;
}
probe api_probe {
.request =
"GET /api/health HTTP/1.1"
"Host: api.example.com"
"Connection: close"
"";
.timeout = 5s;
.interval = 10s;
.window = 5;
.threshold = 3;
}
# 后端定义
backend web1 {
.host = "192.168.1.10";
.port = "80";
.probe = web_probe;
.max_connections = 500;
}
backend web2 {
.host = "192.168.1.11";
.port = "80";
.probe = web_probe;
.max_connections = 500;
}
backend web3 {
.host = "192.168.1.12";
.port = "80";
.probe = web_probe;
.max_connections = 500;
}
backend api1 {
.host = "192.168.2.10";
.port = "8080";
.probe = api_probe;
.max_connections = 300;
}
backend api2 {
.host = "192.168.2.11";
.port = "8080";
.probe = api_probe;
.max_connections = 300;
}
# 初始化负载均衡器
sub vcl_init {
new web_cluster = directors.round_robin();
web_cluster.add_backend(web1);
web_cluster.add_backend(web2);
web_cluster.add_backend(web3);
new api_cluster = directors.round_robin();
api_cluster.add_backend(api1);
api_cluster.add_backend(api2);
}
# 请求处理
sub vcl_recv {
# 检查后端健康
if (req.url ~ "^/api/") {
set req.backend_hint = api_cluster.backend();
} else {
set req.backend_hint = web_cluster.backend();
}
}
# 后端错误处理
sub vcl_backend_response {
if (beresp.status >= 500 && bereq.retries < 3) {
return (retry);
}
}
sub vcl_backend_error {
set beresp.status = 503;
set beresp.http.Content-Type = "text/html; charset=utf-8";
synthetic({"<!DOCTYPE html>
<html>
<head><title>Service Unavailable</title></head>
<body>
<h1>503 Service Unavailable</h1>
<p>The server is temporarily unavailable.</p>
</body>
</html>"});
return (deliver);
}
十一、总结 #
本章我们学习了:
- 健康检查概述:概念、工作原理
- 基本配置:简单配置、完整参数
- 高级配置:自定义请求、POST探测、认证探测
- 共享配置:探测模板、多后端复用
- 状态判断:滑动窗口、状态转换
- 手动管理:设置状态、VCL检查
- 故障转移:fallback调度器、重试机制
- 监控:统计信息、探测日志
- 最佳实践:端点设计、参数调优
掌握健康检查后,让我们进入下一章,学习限流防护!
最后更新:2026-03-28