故障排查 #

一、故障排查概述 #

1.1 排查流程 #

text

┌─────────────────────────────────────────────────────────┐
│                    故障排查流程                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│   发现问题 ──► 收集信息 ──► 定位原因 ──► 解决问题        │
│       │            │            │            │          │
│       │            │            │            │          │
│       ▼            ▼            ▼            ▼          │
│   监控告警     日志分析     诊断工具     修复验证        │
│                                                         │
└─────────────────────────────────────────────────────────┘

1.2 常用诊断工具 #

工具	用途
varnishlog	详细日志分析
varnishstat	统计信息查看
varnishadm	管理命令
varnishtop	实时统计
curl	请求测试

二、常见问题 #

2.1 缓存命中率低 #

症状：

缓存命中率低于预期
大量请求穿透到后端

排查步骤：

bash

# 1. 查看缓存统计
varnishstat -1 -f MAIN.cache_hit -f MAIN.cache_miss

# 2. 查看未命中原因
varnishlog -g request -q "VCL_return eq miss"

# 3. 检查Cookie
varnishlog -i ReqHeader -q "ReqHeader:Cookie"

# 4. 检查Vary头
varnishlog -i BerespHeader -q "BerespHeader:Vary"

解决方案：

vcl

sub vcl_recv {
    # 移除静态资源Cookie
    if (req.url ~ "\.(css|js|png|gif|jpg|jpeg|ico|svg)$") {
        unset req.http.Cookie;
    }
}

sub vcl_backend_response {
    # 移除不必要的Vary
    if (bereq.url ~ "\.(css|js|png|gif|jpg|jpeg|ico|svg)$") {
        unset beresp.http.Vary;
    }
}

2.2 后端连接失败 #

症状：

503错误
后端不可用

排查步骤：

bash

# 1. 查看后端状态
varnishadm backend.list

# 2. 查看后端错误日志
varnishlog -q "BackendFail"

# 3. 测试后端连接
curl -I http://backend-server:8080/healthcheck

# 4. 检查网络连接
telnet backend-server 8080

解决方案：

vcl

# 增加超时时间
backend default {
    .host = "127.0.0.1";
    .port = "8080";
    .connect_timeout = 10s;
    .first_byte_timeout = 120s;
    .between_bytes_timeout = 30s;
}

# 添加健康检查
backend default {
    .probe = {
        .url = "/healthcheck";
        .timeout = 2s;
        .interval = 5s;
        .window = 5;
        .threshold = 3;
    }
}

2.3 内存不足 #

症状：

LRU淘汰频繁
缓存被清理

排查步骤：

bash

# 1. 查看内存使用
varnishstat -1 -f MAIN.s0.g_bytes -f MAIN.s0.g_space

# 2. 查看LRU淘汰
varnishstat -1 -f MAIN.n_lru_nuked

# 3. 查看对象数
varnishstat -1 -f MAIN.n_object

解决方案：

bash

# 增加存储大小
varnishd -s malloc,4G

# 或使用文件存储
varnishd -s file,/var/lib/varnish/storage.bin,10G

2.4 响应慢 #

症状：

请求延迟高
超时频繁

排查步骤：

bash

# 1. 查看慢请求
varnishlog -q "Timestamp:Process[2] > 1.0"

# 2. 查看后端延迟
varnishlog -i Timestamp -i BereqURL

# 3. 查看线程队列
varnishstat -1 -f MAIN.thread_queue_len

# 4. 查看连接数
varnishstat -1 -f MAIN.sess_conn -f MAIN.sess_active

解决方案：

bash

# 增加线程数
varnishadm param.set thread_pool_min 200
varnishadm param.set thread_pool_max 5000

# 优化超时
varnishadm param.set first_byte_timeout 60

2.5 VCL语法错误 #

症状：

VCL加载失败
配置不生效

排查步骤：

bash

# 1. 检查VCL语法
varnishd -C -f /etc/varnish/default.vcl

# 2. 查看错误信息
varnishd -d -f /etc/varnish/default.vcl

# 3. 查看VCL列表
varnishadm vcl.list

解决方案：

bash

# 修复语法错误后重新加载
varnishadm vcl.load new_config /etc/varnish/default.vcl
varnishadm vcl.use new_config

三、调试技巧 #

3.1 启用调试模式 #

bash

# 前台启动调试
varnishd -d -f /etc/varnish/default.vcl

# 查看详细输出
varnishd -d -f /etc/varnish/default.vcl 2>&1 | less

3.2 请求追踪 #

bash

# 追踪特定请求
varnishlog -g request -q "ReqURL ~ ^/api/test"

# 查看VCL执行流程
varnishlog -i VCL_call -i VCL_return

# 查看时间戳
varnishlog -i Timestamp

3.3 添加调试头 #

vcl

sub vcl_deliver {
    # 添加调试信息
    set resp.http.X-Debug-URL = req.url;
    set resp.http.X-Debug-Method = req.method;
    set resp.http.X-Debug-Client = client.ip;
    set resp.http.X-Debug-Hits = obj.hits;
    set resp.http.X-Debug-TTL = obj.ttl;
    set resp.http.X-Debug-Grace = obj.grace;
}

3.4 日志记录 #

vcl

import std;

sub vcl_recv {
    # 记录请求信息
    std.log("Request: " + req.method + " " + req.url);
}

sub vcl_backend_response {
    # 记录后端响应
    std.log("Backend: " + bereq.url + " Status: " + beresp.status);
}

sub vcl_deliver {
    # 记录响应信息
    std.log("Deliver: " + req.url + " Hits: " + obj.hits);
}

四、性能诊断 #

4.1 性能分析脚本 #

bash

#!/bin/bash
# performance_diagnosis.sh

echo "=== Varnish Performance Diagnosis ==="
echo "Time: $(date)"
echo ""

# 1. 缓存命中率
echo "1. Cache Performance"
HITS=$(varnishstat -1 -f MAIN.cache_hit | awk '{print $2}')
MISSES=$(varnishstat -1 -f MAIN.cache_miss | awk '{print $2}')
TOTAL=$((HITS + MISSES))
if [ $TOTAL -gt 0 ]; then
    RATE=$(echo "scale=2; $HITS * 100 / $TOTAL" | bc)
    echo "   Hit Rate: ${RATE}%"
    echo "   Hits: $HITS, Misses: $MISSES"
fi
echo ""

# 2. 请求统计
echo "2. Request Statistics"
REQ=$(varnishstat -1 -f MAIN.client_req | awk '{print $2}')
echo "   Total Requests: $REQ"
echo ""

# 3. 连接统计
echo "3. Connection Statistics"
varnishstat -1 -f MAIN.sess_conn -f MAIN.sess_active -f MAIN.sess_drop
echo ""

# 4. 后端统计
echo "4. Backend Statistics"
varnishstat -1 -f MAIN.backend_conn -f MAIN.backend_fail -f MAIN.backend_unhealthy
echo ""

# 5. 线程统计
echo "5. Thread Statistics"
varnishstat -1 -f MAIN.threads -f MAIN.threads_limited -f MAIN.thread_queue_len
echo ""

# 6. 内存统计
echo "6. Memory Statistics"
varnishstat -1 -f MAIN.s0.g_bytes -f MAIN.s0.g_space
echo ""

# 7. 对象统计
echo "7. Object Statistics"
varnishstat -1 -f MAIN.n_object -f MAIN.n_expired -f MAIN.n_lru_nuked
echo ""

# 8. 后端状态
echo "8. Backend Status"
varnishadm backend.list

4.2 瓶颈分析 #

bash

# 分析慢请求
varnishlog -d -q "Timestamp:Process[2] > 1.0" | \
    grep -E "(ReqURL|Timestamp)" | \
    head -50

# 分析后端延迟
varnishlog -d -i Timestamp -i BereqURL | \
    grep -B1 "BereqURL" | \
    grep "Timestamp" | \
    sort -t: -k3 -n | \
    tail -20

# 分析错误请求
varnishlog -d -q "RespStatus >= 500" | \
    grep -E "(ReqURL|RespStatus|RespReason)" | \
    head -50

五、故障恢复 #

5.1 服务重启 #

bash

# 优雅重启
sudo systemctl reload varnish

# 完全重启
sudo systemctl restart varnish

# 紧急重启
sudo systemctl stop varnish
sudo systemctl start varnish

5.2 配置回滚 #

bash

# 查看VCL列表
varnishadm vcl.list

# 回滚到旧配置
varnishadm vcl.use previous_config

# 删除问题配置
varnishadm vcl.discard broken_config

5.3 清除缓存 #

bash

# 清除所有缓存
varnishadm ban "req.url ~ /"

# 清除特定缓存
varnishadm ban "req.url ~ ^/images/"

# 重启清除所有缓存
sudo systemctl restart varnish

5.4 后端切换 #

bash

# 手动设置后端状态
varnishadm backend.set_health server1 sick
varnishadm backend.set_health server2 healthy

# 切换到备用后端
# 在VCL中配置fallback

六、常见错误代码 #

6.1 503 Service Unavailable #

原因：

后端服务器不可用
后端超时
后端返回错误

排查：

bash

# 查看后端状态
varnishadm backend.list

# 查看后端错误
varnishlog -q "BackendFail"

# 查看后端响应
varnishlog -q "BerespStatus >= 500"

6.2 502 Bad Gateway #

原因：

后端返回无效响应
后端协议错误

排查：

bash

# 查看后端响应
varnishlog -q "BerespStatus == 502"

# 检查后端日志

6.3 504 Gateway Timeout #

原因：

后端响应超时
后端处理时间过长

排查：

bash

# 查看超时日志
varnishlog -q "Timestamp:Beresp[2] > 60"

# 增加超时时间
varnishadm param.set first_byte_timeout 120

七、预防措施 #

7.1 监控告警 #

yaml

# 关键告警规则
- alert: VarnishDown
  expr: up{job="varnish"} == 0
  for: 1m

- alert: VarnishLowHitRate
  expr: |
    rate(varnish_main_cache_hit[5m]) / 
    (rate(varnish_main_cache_hit[5m]) + rate(varnish_main_cache_miss[5m])) < 0.7
  for: 5m

- alert: VarnishBackendDown
  expr: varnish_main_backend_unhealthy > 0
  for: 1m

7.2 健康检查 #

vcl

# 添加健康检查端点
sub vcl_recv {
    if (req.url == "/healthcheck") {
        return (synth(200, "OK"));
    }
}

7.3 优雅降级 #

vcl

sub vcl_backend_error {
    # 返回友好的错误页面
    set beresp.status = 503;
    set beresp.http.Content-Type = "text/html";
    synthetic({"<!DOCTYPE html>
<html>
<head><title>Service Temporarily Unavailable</title></head>
<body>
<h1>We'll be right back</h1>
<p>Our service is temporarily unavailable. Please try again later.</p>
</body>
</html>"});
    return (deliver);
}

八、故障排查清单 #

8.1 服务状态检查 #

[ ] Varnish进程是否运行
[ ] 端口是否监听
[ ] 管理接口是否可用
[ ] 后端服务器是否健康

8.2 性能检查 #

[ ] 缓存命中率是否正常
[ ] 内存使用是否合理
[ ] 线程数是否足够
[ ] 连接数是否正常

8.3 日志检查 #

[ ] 是否有错误日志
[ ] 是否有慢请求
[ ] 是否有后端故障
[ ] 是否有异常请求

8.4 配置检查 #

[ ] VCL语法是否正确
[ ] 后端配置是否正确
[ ] 超时设置是否合理
[ ] 缓存规则是否正确

九、总结 #

本章我们学习了：

故障排查概述：流程、工具
常见问题：缓存命中率低、后端失败、内存不足
调试技巧：调试模式、请求追踪、调试头
性能诊断：分析脚本、瓶颈分析
故障恢复：重启、回滚、清除缓存
错误代码：503、502、504
预防措施：监控告警、健康检查、优雅降级

恭喜你完成了Varnish完全指南的学习！现在你已经掌握了从基础到高级的Varnish知识，可以成为一名Varnish专家了！