Debugging a Kubernetes cluster requires a deep understanding of its components and inter dependencies. Here’s a comprehensive Part 2 guide focusing on advanced debugging techniques for common cluster issues:
- Node Issues
A. Node Not Ready
Check Node Status:bash
kubectl get nodes
kubectl describe node <node-name>
Inspect Kubelet Logs:
SSH into the node and review logs for errors:bash
journalctl -u kubelet -l
Possible Causes:
Resource exhaustion (e.g., CPU, memory, disk).
Misconfigured networking (e.g., unable to reach the API server).
Issues with container runtime (Docker, containerd).
B. Node Disk Pressure or Memory Pressure
Check Allocations:bash
kubectl describe node <node-name> | grep Allocated
Clean Up Disk Space:
Remove unused images and logs:bash
docker system prune
Reconfigure Resource Limits:
Adjust resource requests and limits for pods.
- Pod Issues
A. Pod Stuck in Pending
Inspect Events:bash
kubectl describe pod <pod-name>
Possible Causes:
Insufficient resources: Check node capacity and pod requests.
Scheduling constraints: Inspect nodeSelector, taints, and tolerations.
Networking issues: Ensure the CNI plugin is functioning correctly.
B. CrashLoopBackOff
View Logs:bash
kubectl logs <pod-name> --previous
Check Events:bash
kubectl describe pod <pod-name>
Debugging Steps:
Ensure the container’s entrypoint is correct.
Verify environment variables and mounted volumes.
Test locally using the same image.
C. Container Image Pull Issues
Inspect Events:bash
kubectl describe pod <pod-name>
Common Errors:
Unauthorized: Verify image pull secrets.
Image not found: Confirm the image exists in the registry.
- Networking Issues
A. Pods Can’t Communicate
Ping Other Pods:bash
kubectl exec -it <pod-name> -- ping <pod-ip>
Check Network Policies:bash
kubectl get networkpolicy -n <namespace>
Debugging CNI Plugins:
Inspect CNI logs:bash
cat /var/log/containers/<cni-plugin-name>*.log
B. Service Not Accessible
Check Service Description:bash
kubectl describe svc <service-name>
Inspect Endpoints:bash
kubectl get endpoints <service-name>
Test Connectivity:
From within a pod:bash
curl http://<service-name>.<namespace>:<port>
- API Server Issues
Inspect Logs:bash
journalctl -u kube-apiserver
Test API Server Availability:bash
kubectl get --raw /healthz
Common Causes:
SSL/TLS issues: Check certificates and CA bundle.
Resource bottlenecks: Monitor CPU/memory usage.
- Persistent Volume Issues
A. PVC Pending
Inspect Events:bash
kubectl describe pvc <pvc-name>
Common Causes:
No matching StorageClass.
Insufficient storage on nodes.
B. PV Bound But Pod Can’t Mount
Inspect Logs:bash
kubectl logs <pod-name>
Debugging Steps:
Verify volume permissions.
Test mounting the volume manually on a node.
- Cluster DNS Issues
Test DNS Resolution:bash
kubectl exec -it — nslookup
Inspect CoreDNS Logs:bash
kubectl logs -n kube-system <coredns-pod-name>
Common Fixes:
Restart CoreDNS pods if unresponsive.
Validate ConfigMap for CoreDNS (kubectl get cm -n kube-system coredns).
- Troubleshooting Tools
A. kubectl Debugging Tools
Debug running pods:bash
kubectl exec -it <pod-name> -- /bin/sh
Debug containers with ephemeral containers (Kubernetes v1.18+):bash
kubectl debug -it <pod-name> --image=busybox
B. Third-Party Tools
Lens: GUI for Kubernetes cluster monitoring.
K9s: Terminal-based cluster management.
kubectl-trace: System-level tracing for Kubernetes.
C. Logs Aggregation
Use tools like Fluentd, ELK Stack, or Loki for centralized logging.
- Proactive Cluster Monitoring
Implement monitoring systems like Prometheus, Grafana, or Datadog.
Set up alerting for critical metrics (e.g., node health, pod restarts).
Example: Debugging Workflow for a Non-Responsive Service
Check Pod Status:bash
kubectl get pods -n <namespace>
Describe the Service:bash
kubectl describe svc <service-name> -n <namespace>
Inspect Logs:bash
kubectl logs <pod-name> -n <namespace>
Test Connectivity:
From within a cluster:bash
curl http://<service-name>.<namespace>:<port>
From outside:bash
curl http://<external-ip>:<port>
This deeper dive equips you to troubleshoot and resolve complex Kubernetes issues effectively. Let me know if you’d like specific scenarios or additional examples!
Source link
lol