Nginx健康检查 #

一、健康检查概述 #

1.1 为什么需要健康检查 #

  • 及时发现故障服务器
  • 自动剔除不可用节点
  • 提高服务可用性
  • 减少用户感知的故障时间

1.2 健康检查类型 #

类型 说明 Nginx版本
被动检查 通过实际请求判断 开源版
主动检查 主动探测服务器状态 商业版/第三方模块

二、被动健康检查 #

2.1 基本配置 #

nginx
upstream backend {
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
}

server {
    location / {
        proxy_pass http://backend;
    }
}

2.2 参数说明 #

参数 说明 默认值
max_fails 最大失败次数 1
fail_timeout 失败超时时间 10s

2.3 工作原理 #

  1. 请求失败时记录失败次数
  2. 失败次数达到max_fails时标记服务器不可用
  3. fail_timeout后再次尝试请求

2.4 失败判定 #

默认以下情况判定为失败:

  • 连接失败
  • 超时
  • 服务器返回500、502、503、504

2.5 自定义失败条件 #

nginx
location / {
    proxy_pass http://backend;
    proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
    proxy_next_upstream_tries 3;
    proxy_next_upstream_timeout 10s;
}

2.6 proxy_next_upstream参数 #

参数 说明
error 连接错误
timeout 超时
invalid_header 无效响应头
http_500 500错误
http_502 502错误
http_503 503错误
http_504 504错误
non_idempotent 非幂等请求重试

三、主动健康检查(商业版) #

3.1 基本配置 #

nginx
upstream backend {
    zone backend 64k;
    
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

server {
    location / {
        proxy_pass http://backend;
        health_check interval=5s fails=3 passes=2;
    }
}

3.2 参数说明 #

参数 说明 默认值
interval 检查间隔 5s
fails 连续失败次数 1
passes 连续成功次数 1
uri 检查路径 /
port 检查端口 服务器端口
timeout 检查超时 1s
type 检查类型 http

3.3 自定义检查路径 #

nginx
health_check uri=/health interval=5s fails=3 passes=2;

3.4 匹配响应 #

nginx
match server_ok {
    status 200-399;
    header Content-Type = text/html;
    body "OK";
}

server {
    location / {
        proxy_pass http://backend;
        health_check match=server_ok;
    }
}

3.5 match指令 #

nginx
match server_ok {
    status 200;
    status ! 500-599;
    header Content-Type ~ text;
    body ! "error";
}

3.6 TCP健康检查 #

nginx
stream {
    upstream mysql {
        zone mysql 64k;
        server 192.168.1.10:3306;
        server 192.168.1.11:3306;
    }
    
    server {
        listen 3306;
        proxy_pass mysql;
        health_check interval=5s;
    }
}

四、第三方健康检查模块 #

4.1 nginx_upstream_check_module #

安装:

bash
cd /usr/local/src
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git

cd nginx-1.24.0
patch -p1 < /usr/local/src/nginx_upstream_check_module/check_1.24.0+.patch

./configure --add-module=/usr/local/src/nginx_upstream_check_module
make && make install

4.2 基本配置 #

nginx
upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
    
    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}

server {
    location / {
        proxy_pass http://backend;
    }
    
    location /status {
        check_status;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

4.3 参数说明 #

参数 说明
interval 检查间隔(毫秒)
rise 连续成功次数
fall 连续失败次数
timeout 检查超时(毫秒)
type 检查类型(tcp/http/ssl_hello/mysql/ajp)

4.4 检查类型 #

nginx
check interval=3000 rise=2 fall=3 timeout=1000 type=tcp;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check interval=3000 rise=2 fall=3 timeout=1000 type=ssl_hello;
check interval=3000 rise=2 fall=3 timeout=1000 type=mysql;
check interval=3000 rise=2 fall=3 timeout=1000 type=ajp;

4.5 HTTP检查配置 #

nginx
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;

4.6 状态页面 #

nginx
location /nginx_status {
    check_status;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

访问 /nginx_status 可以看到服务器健康状态。

五、OpenResty健康检查 #

5.1 lua-resty-healthcheck #

安装:

bash
opm get bungle/lua-resty-template
opm get pintsized/lua-resty-http

5.2 配置示例 #

nginx
lua_shared_dict healthcheck 1m;

init_worker_by_lua_block {
    local healthcheck = require "resty.healthcheck"
    local checker = healthcheck.new({
        name = "backend",
        shm = "healthcheck",
        checks = {
            active = {
                http_path = "/health",
                healthy = {
                    interval = 5,
                    successes = 2
                },
                unhealthy = {
                    interval = 5,
                    http_failures = 3
                }
            },
            passive = {
                healthy = {
                    http_statuses = { 200, 201, 202 },
                    successes = 2
                },
                unhealthy = {
                    http_statuses = { 500, 502, 503, 504 },
                    http_failures = 3
                }
            }
        }
    })
    
    checker:add_target("192.168.1.10", 8080)
    checker:add_target("192.168.1.11", 8080)
    
    package.loaded.checker = checker
}

server {
    location / {
        access_by_lua_block {
            local checker = package.loaded.checker
            local ip, port = checker:select_target()
            
            if not ip then
                ngx.exit(503)
            end
            
            ngx.var.upstream_addr = ip .. ":" .. port
        }
        
        proxy_pass http://$upstream_addr;
    }
}

六、自定义健康检查端点 #

6.1 应用端点 #

后端应用应提供健康检查端点:

python
@app.route('/health')
def health():
    return jsonify({
        'status': 'healthy',
        'timestamp': time.time()
    }), 200

6.2 详细健康检查 #

python
@app.route('/health/detail')
def health_detail():
    checks = {
        'database': check_database(),
        'redis': check_redis(),
        'disk': check_disk()
    }
    
    all_healthy = all(checks.values())
    status = 200 if all_healthy else 503
    
    return jsonify({
        'status': 'healthy' if all_healthy else 'unhealthy',
        'checks': checks
    }), status

6.3 Nginx健康检查配置 #

nginx
upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    
    check interval=5000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx;
}

七、监控与告警 #

7.1 Prometheus监控 #

nginx
location /nginx_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

location /upstream_status {
    check_status format=json;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

7.2 告警脚本 #

bash
#!/bin/bash

STATUS=$(curl -s http://localhost/upstream_status)
UNHEALTHY=$(echo $STATUS | jq '.servers | map(select(.status != "up")) | length')

if [ "$UNHEALTHY" -gt 0 ]; then
    echo "Alert: $UNHEALTHY servers are unhealthy"
    # 发送告警通知
fi

7.3 定时检查 #

bash
* * * * * /usr/local/bin/check_nginx_health.sh

八、故障恢复 #

8.1 自动恢复 #

健康检查模块会自动恢复服务器:

nginx
check interval=3000 rise=2 fall=3;

连续成功2次后,服务器恢复。

8.2 手动控制 #

标记服务器下线:

nginx
upstream backend {
    server 192.168.1.10:8080 down;
    server 192.168.1.11:8080;
}

使用备份服务器:

nginx
upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080 backup;
}

8.3 优雅下线 #

nginx
location /admin/downstream {
    allow 127.0.0.1;
    
    content_by_lua_block {
        local checker = package.loaded.checker
        checker:set_target_status("192.168.1.10", 8080, false)
        ngx.say("Server marked as down")
    }
}

九、最佳实践 #

9.1 检查间隔设置 #

场景 建议间隔
高可用服务 2-5秒
普通服务 5-10秒
低优先级服务 10-30秒

9.2 失败/成功阈值 #

nginx
check interval=3000 rise=2 fall=3;
  • rise=2:连续成功2次恢复
  • fall=3:连续失败3次下线

9.3 健康检查端点要求 #

  • 响应时间小于100ms
  • 返回200状态码
  • 不依赖外部服务
  • 轻量级检查

9.4 检查内容 #

nginx
check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\nConnection: close\r\n\r\n";

十、完整配置示例 #

nginx
upstream backend {
    zone backend 64k;
    
    server 192.168.1.10:8080 weight=3;
    server 192.168.1.11:8080 weight=3;
    server 192.168.1.12:8080 weight=2 backup;
    
    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
    
    keepalive 32;
}

server {
    listen 80;
    server_name api.example.com;
    
    location / {
        proxy_pass http://backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        
        proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
        proxy_next_upstream_tries 3;
        proxy_next_upstream_timeout 10s;
    }
    
    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
    
    location /upstream_status {
        check_status format=json;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

十一、总结 #

本章我们学习了:

  1. 被动健康检查:max_fails和fail_timeout
  2. 主动健康检查:商业版health_check
  3. 第三方模块:nginx_upstream_check_module
  4. OpenResty检查:lua-resty-healthcheck
  5. 自定义端点:应用健康检查实现
  6. 监控告警:Prometheus和告警脚本
  7. 故障恢复:自动恢复和手动控制
  8. 最佳实践:检查间隔、阈值设置

掌握健康检查后,让我们进入下一章,学习故障排查!

最后更新:2026-03-27