健康检查 #

一、健康检查概述 #

1.1 什么是健康检查 #

健康检查（Health Check）是Varnish主动探测后端服务器状态的功能，用于：

检测后端是否可用
自动剔除故障服务器
自动恢复健康服务器
实现故障转移

1.2 工作原理 #

text

┌─────────────────────────────────────────────────────────┐
│                    健康检查工作原理                       │
├─────────────────────────────────────────────────────────┤
│                                                         │
│   Varnish ──► Probe Request ──► Backend Server          │
│      │                                                  │
│      │  ◄── Response ──┘                                │
│      │                                                  │
│      ├──► 成功 (200 OK) ──► 标记为 Healthy              │
│      │                                                  │
│      └──► 失败 (超时/错误) ──► 标记为 Sick               │
│                                                         │
└─────────────────────────────────────────────────────────┘

二、基本配置 #

2.1 简单健康检查 #

vcl

backend default {
    .host = "192.168.1.10";
    .port = "80";
    
    .probe = {
        .url = "/healthcheck";
    }
}

2.2 完整配置 #

vcl

backend default {
    .host = "192.168.1.10";
    .port = "80";
    
    .probe = {
        .url = "/healthcheck";
        .timeout = 2s;
        .interval = 5s;
        .window = 5;
        .threshold = 3;
        .initial = 2;
        .expected_response = 200;
    }
}

2.3 参数说明 #

参数	默认值	说明
url	/	探测URL路径
timeout	2s	探测超时时间
interval	5s	探测间隔
window	8	滑动窗口大小
threshold	3	健康阈值
initial	0	初始延迟
expected_response	200	期望响应码

三、高级配置 #

3.1 自定义请求 #

vcl

backend default {
    .host = "192.168.1.10";
    .port = "80";
    
    .probe = {
        .request =
            "GET /healthcheck HTTP/1.1"
            "Host: example.com"
            "User-Agent: Varnish Health Check"
            "Connection: close"
            "";
        .timeout = 2s;
        .interval = 5s;
    }
}

3.2 POST探测 #

vcl

backend api {
    .host = "192.168.1.20";
    .port = "8080";
    
    .probe = {
        .request =
            "POST /api/health HTTP/1.1"
            "Host: api.example.com"
            "Content-Type: application/json"
            "Content-Length: 15"
            ""
            "{"check":"true"}";
        .timeout = 5s;
        .interval = 10s;
    }
}

3.3 带认证探测 #

vcl

backend default {
    .host = "192.168.1.10";
    .port = "80";
    
    .probe = {
        .request =
            "GET /healthcheck HTTP/1.1"
            "Host: example.com"
            "Authorization: Basic dXNlcjpwYXNz"
            "";
        .timeout = 2s;
        .interval = 5s;
    }
}

四、共享探测配置 #

4.1 定义探测模板 #

vcl

probe healthcheck {
    .url = "/healthcheck";
    .timeout = 2s;
    .interval = 5s;
    .window = 5;
    .threshold = 3;
}

backend server1 {
    .host = "192.168.1.10";
    .port = "80";
    .probe = healthcheck;
}

backend server2 {
    .host = "192.168.1.11";
    .port = "80";
    .probe = healthcheck;
}

backend server3 {
    .host = "192.168.1.12";
    .port = "80";
    .probe = healthcheck;
}

4.2 不同探测配置 #

vcl

probe web_probe {
    .url = "/healthcheck";
    .timeout = 2s;
    .interval = 5s;
}

probe api_probe {
    .url = "/api/health";
    .timeout = 5s;
    .interval = 10s;
}

backend web1 {
    .host = "192.168.1.10";
    .port = "80";
    .probe = web_probe;
}

backend api1 {
    .host = "192.168.2.10";
    .port = "8080";
    .probe = api_probe;
}

五、健康状态判断 #

5.1 滑动窗口机制 #

text

探测结果序列: [成功, 成功, 失败, 成功, 成功]

窗口大小(window): 5
健康阈值(threshold): 3

最近5次探测中成功次数 >= 3 → Healthy
最近5次探测中成功次数 < 3 → Sick

5.2 状态转换 #

text

┌─────────────────────────────────────────────────────────┐
│                    状态转换                              │
├─────────────────────────────────────────────────────────┤
│                                                         │
│   Healthy ──► 成功次数 < threshold ──► Sick             │
│                                                         │
│   Sick ──► 成功次数 >= threshold ──► Healthy            │
│                                                         │
│   新后端 ──► initial延迟后开始探测                        │
│                                                         │
└─────────────────────────────────────────────────────────┘

5.3 查看健康状态 #

bash

# 查看后端列表
varnishadm backend.list

# 输出示例
name      ref   probe   health
server1   1     5/5     healthy
server2   1     3/5     healthy
server3   1     1/5     sick

# 详细信息
varnishadm backend.list -p

# 输出示例
server1:
  health: healthy
  probe: 5/5
  last check: 2 seconds ago
  last change: 5 minutes ago

六、手动管理状态 #

6.1 设置健康状态 #

bash

# 设置为健康
varnishadm backend.set_health server1 healthy

# 设置为不健康
varnishadm backend.set_health server1 sick

# 设置为自动（使用探测结果）
varnishadm backend.set_health server1 auto

6.2 VCL中检查状态 #

vcl

import std;

sub vcl_recv {
    # 检查后端健康状态
    if (!std.healthy(req.backend_hint)) {
        # 后端不健康，返回错误
        return (synth(503, "Backend Unavailable"));
    }
}

七、故障转移 #

7.1 使用fallback调度器 #

vcl

import directors;

backend primary {
    .host = "192.168.1.10";
    .port = "80";
    .probe = healthcheck;
}

backend secondary {
    .host = "192.168.1.11";
    .port = "80";
    .probe = healthcheck;
}

sub vcl_init {
    new cluster = directors.fallback();
    cluster.add_backend(primary);
    cluster.add_backend(secondary);
}

sub vcl_recv {
    set req.backend_hint = cluster.backend();
}

7.2 自定义故障转移 #

vcl

backend primary {
    .host = "192.168.1.10";
    .port = "80";
    .probe = healthcheck;
}

backend secondary {
    .host = "192.168.1.11";
    .port = "80";
    .probe = healthcheck;
}

sub vcl_recv {
    if (std.healthy(primary)) {
        set req.backend_hint = primary;
    } else {
        set req.backend_hint = secondary;
    }
}

7.3 重试机制 #

vcl

sub vcl_backend_response {
    # 后端错误时重试
    if (beresp.status >= 500 && bereq.retries < 3) {
        return (retry);
    }
}

sub vcl_backend_error {
    # 切换到备用后端
    if (bereq.retries < 3) {
        set bereq.backend = secondary;
        return (retry);
    }
}

八、健康检查监控 #

8.1 统计信息 #

bash

# 查看健康检查统计
varnishstat -1 -f MAIN.backend_*

# 输出示例
MAIN.backend_conn                1234         Backend connections
MAIN.backend_unhealthy           5            Backend unhealthy events
MAIN.backend_busy                10           Backend busy events
MAIN.backend_fail                3            Backend failures
MAIN.backend_reuse               5678         Backend connection reuse
MAIN.backend_recycle             5678         Backend connection recycle
MAIN.backend_toolate             0            Backend connection too late

8.2 探测日志 #

bash

# 查看探测日志
varnishlog -q "BackendHealth"

# 输出示例
-   BackendHealth default healthy 5/5
-   BackendHealth server1 healthy 5/5
-   BackendHealth server2 sick 1/5

8.3 监控脚本 #

bash

#!/bin/bash
# health_monitor.sh

echo "=== Backend Health Status ==="
echo ""

varnishadm backend.list -p

echo ""
echo "=== Health Check Statistics ==="
varnishstat -1 -f MAIN.backend_*

echo ""
echo "=== Recent Health Changes ==="
varnishlog -d -q "BackendHealth" | tail -20

九、健康检查最佳实践 #

9.1 探测端点设计 #

健康检查端点应该：

python

# 好的健康检查端点示例
@app.route('/healthcheck')
def healthcheck():
    # 检查数据库连接
    if not database.is_connected():
        return 'Database unavailable', 503
    
    # 检查缓存连接
    if not cache.is_connected():
        return 'Cache unavailable', 503
    
    # 检查关键服务
    if not critical_service.is_available():
        return 'Service unavailable', 503
    
    return 'OK', 200

避免：

python

# 不好的示例 - 总是返回200
@app.route('/healthcheck')
def healthcheck():
    return 'OK', 200

# 不好的示例 - 执行耗时操作
@app.route('/healthcheck')
def healthcheck():
    # 执行复杂检查
    run_full_diagnostics()
    return 'OK', 200

9.2 参数调优 #

场景	interval	timeout	window	threshold
高可用	2s	1s	5	3
一般场景	5s	2s	5	3
容忍度高	10s	5s	8	3

9.3 避免雪崩 #

vcl

# 使用随机延迟避免同时探测
probe healthcheck {
    .url = "/healthcheck";
    .interval = 5s;
    .initial = 1s;  # 初始延迟
}

十、完整配置示例 #

vcl

vcl 4.1;

import directors;
import std;

# 探测配置
probe web_probe {
    .url = "/healthcheck";
    .timeout = 2s;
    .interval = 5s;
    .window = 5;
    .threshold = 3;
    .expected_response = 200;
}

probe api_probe {
    .request =
        "GET /api/health HTTP/1.1"
        "Host: api.example.com"
        "Connection: close"
        "";
    .timeout = 5s;
    .interval = 10s;
    .window = 5;
    .threshold = 3;
}

# 后端定义
backend web1 {
    .host = "192.168.1.10";
    .port = "80";
    .probe = web_probe;
    .max_connections = 500;
}

backend web2 {
    .host = "192.168.1.11";
    .port = "80";
    .probe = web_probe;
    .max_connections = 500;
}

backend web3 {
    .host = "192.168.1.12";
    .port = "80";
    .probe = web_probe;
    .max_connections = 500;
}

backend api1 {
    .host = "192.168.2.10";
    .port = "8080";
    .probe = api_probe;
    .max_connections = 300;
}

backend api2 {
    .host = "192.168.2.11";
    .port = "8080";
    .probe = api_probe;
    .max_connections = 300;
}

# 初始化负载均衡器
sub vcl_init {
    new web_cluster = directors.round_robin();
    web_cluster.add_backend(web1);
    web_cluster.add_backend(web2);
    web_cluster.add_backend(web3);
    
    new api_cluster = directors.round_robin();
    api_cluster.add_backend(api1);
    api_cluster.add_backend(api2);
}

# 请求处理
sub vcl_recv {
    # 检查后端健康
    if (req.url ~ "^/api/") {
        set req.backend_hint = api_cluster.backend();
    } else {
        set req.backend_hint = web_cluster.backend();
    }
}

# 后端错误处理
sub vcl_backend_response {
    if (beresp.status >= 500 && bereq.retries < 3) {
        return (retry);
    }
}

sub vcl_backend_error {
    set beresp.status = 503;
    set beresp.http.Content-Type = "text/html; charset=utf-8";
    synthetic({"<!DOCTYPE html>
<html>
<head><title>Service Unavailable</title></head>
<body>
<h1>503 Service Unavailable</h1>
<p>The server is temporarily unavailable.</p>
</body>
</html>"});
    return (deliver);
}

十一、总结 #

本章我们学习了：

健康检查概述：概念、工作原理
基本配置：简单配置、完整参数
高级配置：自定义请求、POST探测、认证探测
共享配置：探测模板、多后端复用
状态判断：滑动窗口、状态转换
手动管理：设置状态、VCL检查
故障转移：fallback调度器、重试机制
监控：统计信息、探测日志
最佳实践：端点设计、参数调优

掌握健康检查后，让我们进入下一章，学习限流防护！