Nginx健康检查 #
一、健康检查概述 #
1.1 为什么需要健康检查 #
- 及时发现故障服务器
- 自动剔除不可用节点
- 提高服务可用性
- 减少用户感知的故障时间
1.2 健康检查类型 #
| 类型 | 说明 | Nginx版本 |
|---|---|---|
| 被动检查 | 通过实际请求判断 | 开源版 |
| 主动检查 | 主动探测服务器状态 | 商业版/第三方模块 |
二、被动健康检查 #
2.1 基本配置 #
nginx
upstream backend {
server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
}
server {
location / {
proxy_pass http://backend;
}
}
2.2 参数说明 #
| 参数 | 说明 | 默认值 |
|---|---|---|
| max_fails | 最大失败次数 | 1 |
| fail_timeout | 失败超时时间 | 10s |
2.3 工作原理 #
- 请求失败时记录失败次数
- 失败次数达到max_fails时标记服务器不可用
- fail_timeout后再次尝试请求
2.4 失败判定 #
默认以下情况判定为失败:
- 连接失败
- 超时
- 服务器返回500、502、503、504
2.5 自定义失败条件 #
nginx
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 10s;
}
2.6 proxy_next_upstream参数 #
| 参数 | 说明 |
|---|---|
| error | 连接错误 |
| timeout | 超时 |
| invalid_header | 无效响应头 |
| http_500 | 500错误 |
| http_502 | 502错误 |
| http_503 | 503错误 |
| http_504 | 504错误 |
| non_idempotent | 非幂等请求重试 |
三、主动健康检查(商业版) #
3.1 基本配置 #
nginx
upstream backend {
zone backend 64k;
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080;
}
server {
location / {
proxy_pass http://backend;
health_check interval=5s fails=3 passes=2;
}
}
3.2 参数说明 #
| 参数 | 说明 | 默认值 |
|---|---|---|
| interval | 检查间隔 | 5s |
| fails | 连续失败次数 | 1 |
| passes | 连续成功次数 | 1 |
| uri | 检查路径 | / |
| port | 检查端口 | 服务器端口 |
| timeout | 检查超时 | 1s |
| type | 检查类型 | http |
3.3 自定义检查路径 #
nginx
health_check uri=/health interval=5s fails=3 passes=2;
3.4 匹配响应 #
nginx
match server_ok {
status 200-399;
header Content-Type = text/html;
body "OK";
}
server {
location / {
proxy_pass http://backend;
health_check match=server_ok;
}
}
3.5 match指令 #
nginx
match server_ok {
status 200;
status ! 500-599;
header Content-Type ~ text;
body ! "error";
}
3.6 TCP健康检查 #
nginx
stream {
upstream mysql {
zone mysql 64k;
server 192.168.1.10:3306;
server 192.168.1.11:3306;
}
server {
listen 3306;
proxy_pass mysql;
health_check interval=5s;
}
}
四、第三方健康检查模块 #
4.1 nginx_upstream_check_module #
安装:
bash
cd /usr/local/src
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git
cd nginx-1.24.0
patch -p1 < /usr/local/src/nginx_upstream_check_module/check_1.24.0+.patch
./configure --add-module=/usr/local/src/nginx_upstream_check_module
make && make install
4.2 基本配置 #
nginx
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}
server {
location / {
proxy_pass http://backend;
}
location /status {
check_status;
access_log off;
allow 127.0.0.1;
deny all;
}
}
4.3 参数说明 #
| 参数 | 说明 |
|---|---|
| interval | 检查间隔(毫秒) |
| rise | 连续成功次数 |
| fall | 连续失败次数 |
| timeout | 检查超时(毫秒) |
| type | 检查类型(tcp/http/ssl_hello/mysql/ajp) |
4.4 检查类型 #
nginx
check interval=3000 rise=2 fall=3 timeout=1000 type=tcp;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check interval=3000 rise=2 fall=3 timeout=1000 type=ssl_hello;
check interval=3000 rise=2 fall=3 timeout=1000 type=mysql;
check interval=3000 rise=2 fall=3 timeout=1000 type=ajp;
4.5 HTTP检查配置 #
nginx
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
4.6 状态页面 #
nginx
location /nginx_status {
check_status;
access_log off;
allow 127.0.0.1;
deny all;
}
访问 /nginx_status 可以看到服务器健康状态。
五、OpenResty健康检查 #
5.1 lua-resty-healthcheck #
安装:
bash
opm get bungle/lua-resty-template
opm get pintsized/lua-resty-http
5.2 配置示例 #
nginx
lua_shared_dict healthcheck 1m;
init_worker_by_lua_block {
local healthcheck = require "resty.healthcheck"
local checker = healthcheck.new({
name = "backend",
shm = "healthcheck",
checks = {
active = {
http_path = "/health",
healthy = {
interval = 5,
successes = 2
},
unhealthy = {
interval = 5,
http_failures = 3
}
},
passive = {
healthy = {
http_statuses = { 200, 201, 202 },
successes = 2
},
unhealthy = {
http_statuses = { 500, 502, 503, 504 },
http_failures = 3
}
}
}
})
checker:add_target("192.168.1.10", 8080)
checker:add_target("192.168.1.11", 8080)
package.loaded.checker = checker
}
server {
location / {
access_by_lua_block {
local checker = package.loaded.checker
local ip, port = checker:select_target()
if not ip then
ngx.exit(503)
end
ngx.var.upstream_addr = ip .. ":" .. port
}
proxy_pass http://$upstream_addr;
}
}
六、自定义健康检查端点 #
6.1 应用端点 #
后端应用应提供健康检查端点:
python
@app.route('/health')
def health():
return jsonify({
'status': 'healthy',
'timestamp': time.time()
}), 200
6.2 详细健康检查 #
python
@app.route('/health/detail')
def health_detail():
checks = {
'database': check_database(),
'redis': check_redis(),
'disk': check_disk()
}
all_healthy = all(checks.values())
status = 200 if all_healthy else 503
return jsonify({
'status': 'healthy' if all_healthy else 'unhealthy',
'checks': checks
}), status
6.3 Nginx健康检查配置 #
nginx
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
check interval=5000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx;
}
七、监控与告警 #
7.1 Prometheus监控 #
nginx
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
location /upstream_status {
check_status format=json;
access_log off;
allow 127.0.0.1;
deny all;
}
7.2 告警脚本 #
bash
#!/bin/bash
STATUS=$(curl -s http://localhost/upstream_status)
UNHEALTHY=$(echo $STATUS | jq '.servers | map(select(.status != "up")) | length')
if [ "$UNHEALTHY" -gt 0 ]; then
echo "Alert: $UNHEALTHY servers are unhealthy"
# 发送告警通知
fi
7.3 定时检查 #
bash
* * * * * /usr/local/bin/check_nginx_health.sh
八、故障恢复 #
8.1 自动恢复 #
健康检查模块会自动恢复服务器:
nginx
check interval=3000 rise=2 fall=3;
连续成功2次后,服务器恢复。
8.2 手动控制 #
标记服务器下线:
nginx
upstream backend {
server 192.168.1.10:8080 down;
server 192.168.1.11:8080;
}
使用备份服务器:
nginx
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080 backup;
}
8.3 优雅下线 #
nginx
location /admin/downstream {
allow 127.0.0.1;
content_by_lua_block {
local checker = package.loaded.checker
checker:set_target_status("192.168.1.10", 8080, false)
ngx.say("Server marked as down")
}
}
九、最佳实践 #
9.1 检查间隔设置 #
| 场景 | 建议间隔 |
|---|---|
| 高可用服务 | 2-5秒 |
| 普通服务 | 5-10秒 |
| 低优先级服务 | 10-30秒 |
9.2 失败/成功阈值 #
nginx
check interval=3000 rise=2 fall=3;
- rise=2:连续成功2次恢复
- fall=3:连续失败3次下线
9.3 健康检查端点要求 #
- 响应时间小于100ms
- 返回200状态码
- 不依赖外部服务
- 轻量级检查
9.4 检查内容 #
nginx
check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\nConnection: close\r\n\r\n";
十、完整配置示例 #
nginx
upstream backend {
zone backend 64k;
server 192.168.1.10:8080 weight=3;
server 192.168.1.11:8080 weight=3;
server 192.168.1.12:8080 weight=2 backup;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
keepalive 32;
}
server {
listen 80;
server_name api.example.com;
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 10s;
}
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
location /upstream_status {
check_status format=json;
access_log off;
allow 127.0.0.1;
deny all;
}
}
十一、总结 #
本章我们学习了:
- 被动健康检查:max_fails和fail_timeout
- 主动健康检查:商业版health_check
- 第三方模块:nginx_upstream_check_module
- OpenResty检查:lua-resty-healthcheck
- 自定义端点:应用健康检查实现
- 监控告警:Prometheus和告警脚本
- 故障恢复:自动恢复和手动控制
- 最佳实践:检查间隔、阈值设置
掌握健康检查后,让我们进入下一章,学习故障排查!
最后更新:2026-03-27