Horizontal Pod Autoscaling
Kubernetes offers powerful capabilities for managing resources and automatically scaling applications to meet demand. This detailed overview will cover Resource Quota for CPU and Memory, and Horizontal Pod Autoscaling (HPA), complete with practical considerations for exams and interviews.
Kubernetes Resource Quota: CPU and Memory Management#
Resource Quotas in Kubernetes allow you to define the minimum and maximum resource allocations for containers within a pod. This is crucial for efficient resource utilization and preventing a single container from consuming excessive resources.
Default Resource Ranges for a Container: When defining resource limits for a container, Kubernetes has default ranges that must be adhered to. These ranges specify the minimum and maximum values for CPU and memory that a container can request or be limited to. * CPU: A container's minimum CPU request cannot be less than 0.5 CPU, and its maximum CPU limit cannot exceed 1 CPU. * Memory: A container's minimum memory request cannot be less than 500MB, and its maximum memory limit cannot exceed 1GB.
Request and Limit Behavior:
Understanding the interaction between requests
and limits
is vital.
* When limit
is mentioned but request
is not: If you specify a limit
for CPU, for instance, 500 milliCPU, but do not mention the request
, then the request
will automatically be set equal to the limit
. This means the container will be allocated 500 milliCPU as its request.
* Example Scenario:
# Example part of a Pod YAML definition
resources:
limits:
cpu: "500m" # Limit is 500 milliCPU
# requests: # Request is not mentioned
request
is mentioned but limit
is not: If you specify a request
for CPU, for example, 600 milliCPU, but do not mention the limit
, then the limit
will not automatically equal the request
. Instead, the limit
will default to its maximum allowed value, which for CPU is 1 CPU.
* Example Scenario:
# Example part of a Pod YAML definition
resources:
requests:
cpu: "600m" # Request is 600 milliCPU
# limits: # Limit is not mentioned
request
and limit
only for memory, for instance, and do not mention CPU resources, then CPU will automatically pick up its default values. This means the CPU request will be 0.5 CPU and the CPU limit will be 1 CPU.
Memory Resource Quota Examples and Error Scenarios:
When defining memory resources, it's critical to stay within the default ranges.
* Setting a Limit Range (e.g., using a LimitRange
object):
You can define a LimitRange
object to set default minimum and maximum memory for containers within a namespace.
* Example YAML structure for LimitRange
:
apiVersion: v1
kind: LimitRange
metadata:
name: mem-default-limit-range
spec:
limits:
- default:
memory: 1Gi # Default limit for memory
defaultRequest:
memory: 500Mi # Default request for memory
type: Container
request
of 600MB and a limit
of 800MB for a container, this is valid because 600MB is above the 500MB minimum request and 800MB is below the 1GB maximum limit.
* Example YAML part:
When applied, this pod will be created successfully, and kubectl describe pod <pod-name>
will show both the requested 600MB and the limited 800MB.
* Error Scenario: Limit Exceeds Maximum: If you try to set a memory limit greater than 1GB (e.g., 1200MB or 1800MB), Kubernetes will throw an error. The error message will explicitly state that the maximum memory for your container is 1GB.
* Example YAML part leading to error:
The command kubectl apply -f mem-too-high-limit.yaml
would result in an error similar to: "The maximum memory for your container is 1GB and you have increased the limit to 1200MB, which is not possible".
* Error Scenario: Request Below Minimum: If you attempt to set a memory request less than 500MB (e.g., 300MB), it will also result in an error. The error message will indicate that the minimum memory usage per container is 500MB.
* Example YAML part leading to error:
The command kubectl apply -f mem-too-low-request.yaml
would result in an error similar to: "The minimum memory usage per container is 500MB, but you have requested 300MB which is not possible".
These resource quota configurations ensure that containers operate within defined boundaries, preventing resource hogging and promoting cluster stability. It is important to note that these default limits apply to containers within a specific namespace when you explicitly define it, not necessarily to the default namespace.
Horizontal Pod Autoscaling (HPA)#
Horizontal Pod Autoscaling (HPA) is a Kubernetes feature that automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or other custom metrics. It's a critical component for achieving automation in modern DevOps environments, as manual scaling is inefficient and impractical for fluctuating loads.
Concept and Need for HPA: Imagine an application like Hotstar during a major cricket match, experiencing over 1.2 crore (12 million) viewers. Without auto-scaling, handling such sudden and massive increases in user traffic would require manual intervention, leading to potential service disruptions. HPA addresses this by allowing your application to automatically create new pods (horizontal scaling) when the load increases, and reduce pods when the load decreases.
Core Components of HPA: HPA relies on several key components to function effectively: 1. Scalable Object: HPA scales objects like Deployments, ReplicaSets, or Replication Controllers. It cannot scale every Kubernetes object, such as Services. 2. Horizontal Pod Autoscaler (HPA) Object: This is a dedicated Kubernetes API resource and a controller that defines the auto-scaling behavior. It monitors the metrics and makes decisions to scale up or down. 3. Metric Server: This is a crucial add-on that needs to be installed in your cluster. Its primary role is to collect resource metrics (like CPU utilization, memory utilization, storage utilization) from all running pods. The HPA controller constantly queries the Metric Server for this data.
How HPA Makes Scaling Decisions: * Metric Collection: The Metric Server continuously collects data on CPU utilization (and other configured metrics) for all running pods. * Average Utilization: HPA evaluates the average CPU utilization across all pods of a given scalable object (e.g., a Deployment). For example, if you have two containers, one with 40% load and another with 20% load, the average would be (40+20)/2 = 30%. * Target Comparison: The HPA configuration includes a target CPU utilization percentage (e.g., 20% or 30%). If the calculated average utilization exceeds this target, HPA will trigger a scale-up event. * Pod Creation/Deletion: Based on the average utilization relative to the target, HPA calculates how many new pods are needed to bring the average utilization down to the target. It then instructs the Deployment or ReplicaSet to create (or delete) pods accordingly. For instance, if one pod has 80% CPU utilization and the target is 30%, HPA might create two more pods, so the 80% load can be distributed, aiming for an average closer to the 30% target.
Scaling Parameters:
When configuring HPA, you define key parameters:
* Target CPU Percentage: The desired average CPU utilization for your pods (e.g., 20%
or 30%
).
* Minimum Replicas: The minimum number of pods that should always be running, even if the load is very low (e.g., 1
pod). HPA will not scale down below this number.
* Maximum Replicas: The maximum number of pods HPA can create (e.g., 10
pods). This prevents uncontrolled scaling and resource exhaustion.
Downscaling Behavior and Cooling Period: While HPA scales up quickly when load increases, it has a more cautious approach to scaling down. * Cooling Period: To prevent rapid, unnecessary scaling fluctuations (known as "thrashing"), HPA has a default cooling period of 5 minutes before it will delete pods. This means that even if the load drops significantly, HPA will wait for 5 minutes to confirm that the low load persists before downscaling. If the load increases again within this 5-minute window, it avoids deleting pods unnecessarily. * Check Frequency: The HPA controller checks the resource metrics from the Metric Server and re-evaluates scaling decisions every 30 seconds.
Horizontal vs. Vertical Autoscaling: It is important to distinguish between these two types of scaling: * Horizontal Scaling: Involves creating new pods to distribute the load across more instances. This is like adding more rooms to a house when more family members arrive. * Vertical Scaling: Involves increasing the capacity of existing pods (e.g., allocating more CPU or memory to an already running pod). This is like expanding an existing room to accommodate more people. The video uses an analogy of increasing the capacity of an existing 50GB storage to 100GB (vertical) versus adding a completely new 50GB storage (horizontal). HPA focuses on horizontal scaling.
Practical Implementation Steps for HPA#
To demonstrate HPA, you need to set up the Metric Server, create a sample application Deployment, and then configure the HPA object.
-
Setting up the Metric Server: The Metric Server is not installed by default in Kubernetes and is essential for HPA to work.
- Download the Metric Server YAML: You typically download it from its GitHub repository, which is an open-source project.
Command:
wget https://raw.githubusercontent.com/kubernetes-sigs/metrics-server/master/deploy/1.8%2B/metrics-server-deployment.yaml
(Note: The exact URL might vary slightly depending on the Kubernetes version and Metric Server release. The source indicates a GitHub link for download). - Modify the Metric Server YAML: After downloading, you need to edit the
metrics-server-deployment.yaml
file. Inside theDeployment
section, find theargs
(arguments) for themetrics-server
container. You must add the argument--kubelet-insecure-tls
to bypass certificate validation, as you typically won't have self-signed certificates set up in a basic lab environment.- YAML modification example (inside
containers[].args
formetrics-server
):# ... (other parts of metrics-server-deployment.yaml) containers: - name: metrics-server image: k8s.gcr.io/metrics-server/metrics-server:v0.5.0 # Example image args: - --cert-dir=/tmp - --secure-port=4443 - --kubelet-insecure-tls # Add this line - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --kubelet-use-node-status-port # ...
- YAML modification example (inside
- Apply the Metric Server YAML: After modification, apply the YAML file to deploy the Metric Server.
Command:
kubectl apply -f metrics-server-deployment.yaml
. - Verify Metric Server: You can check the pods in the
kube-system
namespace to ensure the Metric Server pod is running. You can also view its logs to confirm it's collecting metrics. Commands:kubectl get pods -n kube-system
kubectl logs -f <metrics-server-pod-name> -n kube-system
(Look for "Generated self-signed certificate" to confirm proper setup).
- Download the Metric Server YAML: You typically download it from its GitHub repository, which is an open-source project.
Command:
-
Creating a Sample Deployment: Next, create a sample application deployment that HPA will manage. This deployment will typically include resource requests and limits.
- Deployment YAML structure example:
This YAML defines a deployment named
apiVersion: apps/v1 kind: Deployment metadata: name: my-deploy # Name of your deployment spec: replicas: 1 # Start with 1 replica for demonstration selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container # Name of your container image: httpd:latest # Example image (Apache server) ports: - containerPort: 80 resources: requests: cpu: "200m" # Request 200 milliCPU limits: cpu: "500m" # Limit 500 milliCPU
my-deploy
with an Apache server running in a container, requesting 200m CPU and limited to 500m CPU. It starts with one replica. - Apply the Deployment YAML:
Command:
kubectl apply -f my-deployment.yaml
. - Verify Deployment and Pod:
Command:
kubectl get all
(You should see your deployment, replica set, and one running pod).
- Deployment YAML structure example:
-
Configuring the HPA Object: Now, create the HPA object that will monitor and scale your deployment.
- HPA creation command:
Command:
kubectl autoscale deployment my-deploy --cpu-percent=20 --min=1 --max=10
.deployment my-deploy
: Specifies that HPA should scale themy-deploy
deployment.--cpu-percent=20
: Sets the target average CPU utilization to 20%.--min=1
: Sets the minimum number of pods to 1.--max=10
: Sets the maximum number of pods to 10.
- Verify HPA: After running the command, you can check the HPA status.
Command:
kubectl get hpa
(You will see the HPA namedmy-deploy
, target CPU, min/max pods, and current replicas). Initially, the target CPU might show0%
or a low value if no load is present.
- HPA creation command:
Command:
-
Demonstrating Autoscaling: To see HPA in action, you need to simulate a load increase on your application.
- Open Two Terminals: Keep one terminal for monitoring and another for generating load.
- Monitoring Terminal: In the first terminal, continuously monitor your pods and HPA.
Command:
watch kubectl get all
(This command will refresh every 2 seconds, showing changes in pod counts and HPA metrics). You will observe theTARGET
column for your HPA. - Load Generation Terminal: In the second terminal, get a shell into your running application pod.
Command:
kubectl exec -it <my-deploy-pod-name> -- /bin/bash
. - Generate Load: Inside the pod's shell, run a command that consumes CPU, such as an
apt update
or a simple infinite loop. Command (example for CPU load):while true; do true; done
(This command will make the CPU usage of the container spike). Alternatively, if your image hasapt
, you can runapt update
repeatedly. - Observe Scaling Up: As you run the load generation command, watch the monitoring terminal. You will see the
TARGET
CPU percentage of your HPA increase from0%
(or its initial low value). Once it consistently exceeds20%
, you will observe new pods being created automatically. TheREPLICAS
count inkubectl get all
will increase, and new pod entries will appear asRunning
. - Observe Scaling Down: Once you stop the load generation (e.g., by pressing
Ctrl+C
in the load generation terminal), the CPU utilization will drop. Due to the 5-minute cooling period, the extra pods will not be deleted immediately. After approximately 5 minutes, if the load remains low, HPA will automatically delete the excess pods, returning the replica count closer to the minimum defined value (e.g., 1).
This practical demonstration highlights how Kubernetes, through HPA and Metric Server, can automatically manage application scaling based on real-time resource utilization, making your applications resilient and efficient in handling varying loads.