Nginx健康检查 #

一、健康检查概述 #

1.1 为什么需要健康检查 #

及时发现故障服务器
自动剔除不可用节点
提高服务可用性
减少用户感知的故障时间

1.2 健康检查类型 #

类型	说明	Nginx版本
被动检查	通过实际请求判断	开源版
主动检查	主动探测服务器状态	商业版/第三方模块

二、被动健康检查 #

2.1 基本配置 #

nginx

upstream backend {
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
}

server {
    location / {
        proxy_pass http://backend;
    }
}

2.2 参数说明 #

参数	说明	默认值
max_fails	最大失败次数	1
fail_timeout	失败超时时间	10s

2.3 工作原理 #

请求失败时记录失败次数
失败次数达到max_fails时标记服务器不可用
fail_timeout后再次尝试请求

2.4 失败判定 #

默认以下情况判定为失败：

连接失败
超时
服务器返回500、502、503、504

2.5 自定义失败条件 #

nginx

location / {
    proxy_pass http://backend;
    proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
    proxy_next_upstream_tries 3;
    proxy_next_upstream_timeout 10s;
}

2.6 proxy_next_upstream参数 #

参数	说明
error	连接错误
timeout	超时
invalid_header	无效响应头
http_500	500错误
http_502	502错误
http_503	503错误
http_504	504错误
non_idempotent	非幂等请求重试

三、主动健康检查（商业版） #

3.1 基本配置 #

nginx

upstream backend {
    zone backend 64k;
    
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

server {
    location / {
        proxy_pass http://backend;
        health_check interval=5s fails=3 passes=2;
    }
}

3.2 参数说明 #

参数	说明	默认值
interval	检查间隔	5s
fails	连续失败次数	1
passes	连续成功次数	1
uri	检查路径	/
port	检查端口	服务器端口
timeout	检查超时	1s
type	检查类型	http

3.3 自定义检查路径 #

nginx

health_check uri=/health interval=5s fails=3 passes=2;

3.4 匹配响应 #

nginx

match server_ok {
    status 200-399;
    header Content-Type = text/html;
    body "OK";
}

server {
    location / {
        proxy_pass http://backend;
        health_check match=server_ok;
    }
}

3.5 match指令 #

nginx

match server_ok {
    status 200;
    status ! 500-599;
    header Content-Type ~ text;
    body ! "error";
}

3.6 TCP健康检查 #

nginx

stream {
    upstream mysql {
        zone mysql 64k;
        server 192.168.1.10:3306;
        server 192.168.1.11:3306;
    }
    
    server {
        listen 3306;
        proxy_pass mysql;
        health_check interval=5s;
    }
}

四、第三方健康检查模块 #

4.1 nginx_upstream_check_module #

安装：

bash

cd /usr/local/src
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git

cd nginx-1.24.0
patch -p1 < /usr/local/src/nginx_upstream_check_module/check_1.24.0+.patch

./configure --add-module=/usr/local/src/nginx_upstream_check_module
make && make install

4.2 基本配置 #

nginx

upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
    
    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}

server {
    location / {
        proxy_pass http://backend;
    }
    
    location /status {
        check_status;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

4.3 参数说明 #

参数	说明
interval	检查间隔（毫秒）
rise	连续成功次数
fall	连续失败次数
timeout	检查超时（毫秒）
type	检查类型（tcp/http/ssl_hello/mysql/ajp）

4.4 检查类型 #

nginx

check interval=3000 rise=2 fall=3 timeout=1000 type=tcp;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check interval=3000 rise=2 fall=3 timeout=1000 type=ssl_hello;
check interval=3000 rise=2 fall=3 timeout=1000 type=mysql;
check interval=3000 rise=2 fall=3 timeout=1000 type=ajp;

4.5 HTTP检查配置 #

nginx

check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;

4.6 状态页面 #

nginx

location /nginx_status {
    check_status;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

访问 /nginx_status 可以看到服务器健康状态。

五、OpenResty健康检查 #

5.1 lua-resty-healthcheck #

安装：

bash

opm get bungle/lua-resty-template
opm get pintsized/lua-resty-http

5.2 配置示例 #

nginx

lua_shared_dict healthcheck 1m;

init_worker_by_lua_block {
    local healthcheck = require "resty.healthcheck"
    local checker = healthcheck.new({
        name = "backend",
        shm = "healthcheck",
        checks = {
            active = {
                http_path = "/health",
                healthy = {
                    interval = 5,
                    successes = 2
                },
                unhealthy = {
                    interval = 5,
                    http_failures = 3
                }
            },
            passive = {
                healthy = {
                    http_statuses = { 200, 201, 202 },
                    successes = 2
                },
                unhealthy = {
                    http_statuses = { 500, 502, 503, 504 },
                    http_failures = 3
                }
            }
        }
    })
    
    checker:add_target("192.168.1.10", 8080)
    checker:add_target("192.168.1.11", 8080)
    
    package.loaded.checker = checker
}

server {
    location / {
        access_by_lua_block {
            local checker = package.loaded.checker
            local ip, port = checker:select_target()
            
            if not ip then
                ngx.exit(503)
            end
            
            ngx.var.upstream_addr = ip .. ":" .. port
        }
        
        proxy_pass http://$upstream_addr;
    }
}

六、自定义健康检查端点 #

6.1 应用端点 #

后端应用应提供健康检查端点：

python

@app.route('/health')
def health():
    return jsonify({
        'status': 'healthy',
        'timestamp': time.time()
    }), 200

6.2 详细健康检查 #

python

@app.route('/health/detail')
def health_detail():
    checks = {
        'database': check_database(),
        'redis': check_redis(),
        'disk': check_disk()
    }
    
    all_healthy = all(checks.values())
    status = 200 if all_healthy else 503
    
    return jsonify({
        'status': 'healthy' if all_healthy else 'unhealthy',
        'checks': checks
    }), status

6.3 Nginx健康检查配置 #

nginx

upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    
    check interval=5000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx;
}

七、监控与告警 #

7.1 Prometheus监控 #

nginx

location /nginx_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

location /upstream_status {
    check_status format=json;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

7.2 告警脚本 #

bash

#!/bin/bash

STATUS=$(curl -s http://localhost/upstream_status)
UNHEALTHY=$(echo $STATUS | jq '.servers | map(select(.status != "up")) | length')

if [ "$UNHEALTHY" -gt 0 ]; then
    echo "Alert: $UNHEALTHY servers are unhealthy"
    # 发送告警通知
fi

7.3 定时检查 #

bash

* * * * * /usr/local/bin/check_nginx_health.sh

八、故障恢复 #

8.1 自动恢复 #

健康检查模块会自动恢复服务器：

nginx

check interval=3000 rise=2 fall=3;

连续成功2次后，服务器恢复。

8.2 手动控制 #

标记服务器下线：

nginx

upstream backend {
    server 192.168.1.10:8080 down;
    server 192.168.1.11:8080;
}

使用备份服务器：

nginx

upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080 backup;
}

8.3 优雅下线 #

nginx

location /admin/downstream {
    allow 127.0.0.1;
    
    content_by_lua_block {
        local checker = package.loaded.checker
        checker:set_target_status("192.168.1.10", 8080, false)
        ngx.say("Server marked as down")
    }
}

九、最佳实践 #

9.1 检查间隔设置 #

场景	建议间隔
高可用服务	2-5秒
普通服务	5-10秒
低优先级服务	10-30秒

9.2 失败/成功阈值 #

nginx

check interval=3000 rise=2 fall=3;

rise=2：连续成功2次恢复
fall=3：连续失败3次下线

9.3 健康检查端点要求 #

响应时间小于100ms
返回200状态码
不依赖外部服务
轻量级检查

9.4 检查内容 #

nginx

check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\nConnection: close\r\n\r\n";

十、完整配置示例 #

nginx

upstream backend {
    zone backend 64k;
    
    server 192.168.1.10:8080 weight=3;
    server 192.168.1.11:8080 weight=3;
    server 192.168.1.12:8080 weight=2 backup;
    
    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
    
    keepalive 32;
}

server {
    listen 80;
    server_name api.example.com;
    
    location / {
        proxy_pass http://backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        
        proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
        proxy_next_upstream_tries 3;
        proxy_next_upstream_timeout 10s;
    }
    
    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
    
    location /upstream_status {
        check_status format=json;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

十一、总结 #

本章我们学习了：

被动健康检查：max_fails和fail_timeout
主动健康检查：商业版health_check
第三方模块：nginx_upstream_check_module
OpenResty检查：lua-resty-healthcheck
自定义端点：应用健康检查实现
监控告警：Prometheus和告警脚本
故障恢复：自动恢复和手动控制
最佳实践：检查间隔、阈值设置

掌握健康检查后，让我们进入下一章，学习故障排查！