故障排查 #

一、故障排查概述 #

Kubernetes故障排查是运维工作的重要部分，需要系统的方法和工具。

1.1 排查思路 #

text

故障排查思路
    │
    ├── 1. 确认问题范围
    │   ├── 集群级别
    │   ├── 命名空间级别
    │   └── Pod级别
    │
    ├── 2. 收集信息
    │   ├── 查看状态
    │   ├── 查看事件
    │   └── 查看日志
    │
    ├── 3. 分析原因
    │   ├── 配置问题
    │   ├── 资源问题
    │   └── 网络问题
    │
    └── 4. 解决问题
        ├── 修改配置
        ├── 调整资源
        └── 修复网络

1.2 常用命令 #

命令	用途
kubectl get	查看资源状态
kubectl describe	查看详细信息
kubectl logs	查看日志
kubectl exec	进入容器
kubectl events	查看事件

二、Pod故障排查 #

2.1 Pod状态 #

状态	说明	可能原因
Pending	等待调度	资源不足、节点选择器不匹配
Running	运行中	正常状态
Succeeded	成功完成	Job完成
Failed	运行失败	容器崩溃、健康检查失败
Unknown	状态未知	节点通信问题

2.2 Pending状态排查 #

bash

# 查看Pod详情
kubectl describe pod <pod-name>

# 查看事件
kubectl get events --field-selector involvedObject.name=<pod-name>

# 常见原因
# 1. 资源不足
# 2. 节点选择器不匹配
# 3. 污点容忍不匹配
# 4. PVC未绑定

2.3 CrashLoopBackOff排查 #

bash

# 查看Pod状态
kubectl describe pod <pod-name>

# 查看容器日志
kubectl logs <pod-name> --previous

# 查看退出码
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# 常见原因
# 1. 应用启动失败
# 2. 健康检查失败
# 3. 资源不足
# 4. 配置错误

2.4 ImagePullBackOff排查 #

bash

# 查看Pod详情
kubectl describe pod <pod-name>

# 常见原因
# 1. 镜像不存在
# 2. 镜像仓库认证失败
# 3. 网络问题
# 4. 镜像名称错误

# 解决方案
# 1. 检查镜像名称
# 2. 配置imagePullSecrets
# 3. 检查网络连接

三、Service故障排查 #

3.1 Service无法访问 #

bash

# 查看Service
kubectl get svc <service-name>

# 查看Endpoints
kubectl get endpoints <service-name>

# 检查Pod标签
kubectl get pods -l app=<app-name>

# 测试Service
kubectl run test --image=busybox --rm -it -- wget -qO- <service-name>:<port>

3.2 Endpoints为空 #

bash

# 检查Pod标签
kubectl get pods --show-labels

# 检查Service选择器
kubectl get svc <service-name> -o yaml

# 检查Pod就绪状态
kubectl get pods -l app=<app-name>

# 常见原因
# 1. 标签不匹配
# 2. Pod未就绪
# 3. 选择器配置错误

3.3 DNS解析问题 #

bash

# 测试DNS解析
kubectl run test --image=busybox --rm -it -- nslookup <service-name>

# 检查CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 查看CoreDNS日志
kubectl logs -n kube-system -l k8s-app=kube-dns

# 检查Pod DNS配置
kubectl exec -it <pod> -- cat /etc/resolv.conf

四、网络故障排查 #

4.1 Pod间网络不通 #

bash

# 测试Pod间连通性
kubectl exec -it <source-pod> -- ping <target-pod-ip>

# 检查网络插件
kubectl get pods -n kube-system -l k8s-app=calico-node

# 检查网络策略
kubectl get networkpolicy -A

# 检查节点网络
kubectl exec -it <pod> -- ip route

4.2 外部访问问题 #

bash

# 检查Service类型
kubectl get svc <service-name>

# 检查NodePort
kubectl get svc <service-name> -o jsonpath='{.spec.ports[0].nodePort}'

# 检查Ingress
kubectl get ingress
kubectl describe ingress <ingress-name>

# 检查防火墙
iptables -L -n

五、存储故障排查 #

5.1 PVC Pending #

bash

# 查看PVC状态
kubectl describe pvc <pvc-name>

# 查看PV状态
kubectl get pv

# 检查StorageClass
kubectl get storageclass

# 常见原因
# 1. 无匹配的PV
# 2. StorageClass不存在
# 3. 动态供给失败

5.2 挂载失败 #

bash

# 查看Pod事件
kubectl describe pod <pod-name>

# 检查PV状态
kubectl describe pv <pv-name>

# 检查存储后端
# 根据存储类型检查

# 常见原因
# 1. PV不存在
# 2. 存储后端故障
# 3. 权限问题

六、节点故障排查 #

6.1 节点NotReady #

bash

# 查看节点状态
kubectl describe node <node-name>

# 检查kubelet
systemctl status kubelet

# 查看kubelet日志
journalctl -u kubelet -f

# 检查容器运行时
systemctl status containerd

# 常见原因
# 1. kubelet停止
# 2. 容器运行时故障
# 3. 网络问题

6.2 节点资源不足 #

bash

# 查看节点资源
kubectl describe node <node-name>

# 查看资源使用
kubectl top node <node-name>

# 查看Pod资源
kubectl top pods --all-namespaces --sort-by=memory

# 解决方案
# 1. 清理未使用资源
# 2. 迁移Pod到其他节点
# 3. 扩容节点

七、调试工具 #

7.1 kubectl debug #

bash

# 创建调试容器
kubectl debug <pod-name> -it --image=busybox

# 复制Pod调试
kubectl debug <pod-name> -it --copy-to=debug-pod --image=busybox

# 节点调试
kubectl debug node/<node-name> -it --image=busybox

7.2 临时Pod调试 #

yaml

apiVersion: v1
kind: Pod
metadata:
  name: debug-pod
spec:
  containers:
  - name: debug
    image: nicolaka/netshoot
    command: ["sleep", "3600"]

bash

# 创建调试Pod
kubectl apply -f debug-pod.yaml

# 进入调试
kubectl exec -it debug-pod -- /bin/bash

# 网络调试工具
curl, wget, ping, nslookup, dig, netstat, ss, tcpdump

7.3 日志分析 #

bash

# 查看Pod日志
kubectl logs <pod-name>

# 实时日志
kubectl logs -f <pod-name>

# 多容器日志
kubectl logs <pod-name> --all-containers

# 查看之前容器日志
kubectl logs <pod-name> --previous

# 日志输出到文件
kubectl logs <pod-name> > app.log

八、常见问题解决 #

8.1 OOMKilled #

bash

# 查看Pod状态
kubectl describe pod <pod-name>

# 查看退出码
# OOMKilled退出码: 137

# 解决方案
# 1. 增加内存limits
# 2. 优化应用内存使用
# 3. 检查内存泄漏

8.2 健康检查失败 #

bash

# 查看Pod事件
kubectl describe pod <pod-name>

# 检查探针配置
kubectl get pod <pod-name> -o yaml

# 测试探针端点
kubectl exec -it <pod> -- curl localhost:<port>/<path>

# 解决方案
# 1. 调整探针参数
# 2. 修复应用健康检查端点
# 3. 增加初始延迟时间

8.3 证书过期 #

bash

# 检查证书有效期
kubeadm certs check-expiration

# 更新证书
kubeadm certs renew all

# 重启控制平面
kubectl restart kube-apiserver

九、排查流程图 #

text

Pod故障排查流程
    │
    ├── Pending
    │   ├── 检查资源是否充足
    │   ├── 检查节点选择器
    │   └── 检查PVC状态
    │
    ├── CrashLoopBackOff
    │   ├── 查看容器日志
    │   ├── 检查健康检查
    │   └── 检查资源配置
    │
    ├── ImagePullBackOff
    │   ├── 检查镜像名称
    │   ├── 检查镜像仓库认证
    │   └── 检查网络连接
    │
    └── Running但异常
        ├── 检查应用日志
        ├── 检查网络连通性
        └── 检查存储挂载

十、总结 #

10.1 核心要点 #

排查项	关键命令
Pod状态	kubectl describe pod
日志分析	kubectl logs
网络调试	kubectl exec, debug
存储问题	kubectl describe pvc/pv
节点问题	kubectl describe node

10.2 下一步 #

掌握了故障排查后，让我们学习集群升级，了解Kubernetes版本升级方法。