The Insufficient Capacity error in Kubernetes typically occurs when the cluster does not have enough resources (CPU, memory, storage, or specific hardware like GPUs) to schedule a pod. Let's break it down:
🔴 Causes of the Insufficient Capacity Error:
Node Resource Exhaustion – All available nodes lack sufficient CPU/memory.
Pod Resource Requests Exceed Available Resources – The requested resources are too high.
Tainted or Unschedulable Nodes – Nodes might have taints preventing scheduling.
Affinity and Anti-Affinity Constraints – Pods might have rules restricting where they can run.
Resource Limits on Namespaces – Namespace-level quotas might be exhausted.
Pod Disruption Budget (PDB) Restrictions – May prevent new pods from starting.
GPU or Special Hardware Constraints – If specific hardware is needed but unavailable.
🛠 Troubleshooting the Insufficient Capacity Error:
Check Node Resource Availability:
bash
kubectl describe node <node-name>
Look for
Allocatable
vsCapacity
values.Check Pod Scheduling Events:
bash
kubectl describe pod <pod-name>
Look for messages like
"0/5 nodes available: insufficient memory"
List Available Nodes:
bash
kubectl get nodes
Ensure enough nodes are in
Ready
state.Check Resource Requests & Limits:
bash
kubectl get pod <pod-name> -o yaml
Adjust requests in the deployment YAML if needed.
Inspect Node Affinity & Tolerations:
bash
kubectl describe pod <pod-name>
Ensure proper affinity/toleration settings.
Check Namespace Quotas:
bash
kubectl get resourcequota --all-namespaces
Monitor Cluster Metrics (Optional):
bash
kubectl top nodes kubectl top pods
Identify resource-heavy workloads.
✅ Preventive Approaches:
Autoscaling
Enable Cluster Autoscaler:
yaml
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: my-app-hpa spec: maxReplicas: 10 minReplicas: 2 targetCPUUtilizationPercentage: 70
Configure Cluster Autoscaler for auto-scaling nodes.
Resource Requests & Limits:
Define optimal requests:
yaml
resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1" memory: "1Gi"
Node Pool Expansion:
Increase nodes manually in cloud provider settings.
Optimize Workloads:
Scale down less critical workloads.
Monitor with Prometheus & Grafana:
Set up alerts for high resource usage.