Horizontal Pod Autoscaling

Kubernetes offers powerful capabilities for managing resources and automatically scaling applications to meet demand. This detailed overview will cover Resource Quota for CPU and Memory, and Horizontal Pod Autoscaling (HPA), complete with practical considerations for exams and interviews.

Kubernetes Resource Quota: CPU and Memory Management#

Resource Quotas in Kubernetes allow you to define the minimum and maximum resource allocations for containers within a pod. This is crucial for efficient resource utilization and preventing a single container from consuming excessive resources.

Default Resource Ranges for a Container: When defining resource limits for a container, Kubernetes has default ranges that must be adhered to. These ranges specify the minimum and maximum values for CPU and memory that a container can request or be limited to. * CPU: A container's minimum CPU request cannot be less than 0.5 CPU, and its maximum CPU limit cannot exceed 1 CPU. * Memory: A container's minimum memory request cannot be less than 500MB, and its maximum memory limit cannot exceed 1GB.

Request and Limit Behavior: Understanding the interaction between requests and limits is vital. * When limit is mentioned but request is not: If you specify a limit for CPU, for instance, 500 milliCPU, but do not mention the request, then the request will automatically be set equal to the limit. This means the container will be allocated 500 milliCPU as its request. * Example Scenario:

# Example part of a Pod YAML definition
resources:
  limits:
    cpu: "500m" # Limit is 500 milliCPU
  # requests: # Request is not mentioned

In this case, the effective request will also be 500 milliCPU. * When request is mentioned but limit is not: If you specify a request for CPU, for example, 600 milliCPU, but do not mention the limit, then the limit will not automatically equal the request. Instead, the limit will default to its maximum allowed value, which for CPU is 1 CPU. * Example Scenario:

# Example part of a Pod YAML definition
resources:
  requests:
    cpu: "600m" # Request is 600 milliCPU
  # limits: # Limit is not mentioned

In this case, the effective limit will be 1 CPU. * When only one resource (e.g., Memory) is mentioned: If you define request and limit only for memory, for instance, and do not mention CPU resources, then CPU will automatically pick up its default values. This means the CPU request will be 0.5 CPU and the CPU limit will be 1 CPU.

Memory Resource Quota Examples and Error Scenarios: When defining memory resources, it's critical to stay within the default ranges. * Setting a Limit Range (e.g., using a LimitRange object): You can define a LimitRange object to set default minimum and maximum memory for containers within a namespace. * Example YAML structure for LimitRange:

apiVersion: v1
kind: LimitRange
metadata:
  name: mem-default-limit-range
spec:
  limits:
  - default:
      memory: 1Gi # Default limit for memory
    defaultRequest:
      memory: 500Mi # Default request for memory
    type: Container

This example sets a maximum memory of 1GB and a minimum request of 500MB for any container within the specified namespace. * Valid Memory Allocation: If you set a request of 600MB and a limit of 800MB for a container, this is valid because 600MB is above the 500MB minimum request and 800MB is below the 1GB maximum limit. * Example YAML part:

resources:
  requests:
    memory: "600Mi"
  limits:
    memory: "800Mi"

When applied, this pod will be created successfully, and kubectl describe pod <pod-name> will show both the requested 600MB and the limited 800MB. * Error Scenario: Limit Exceeds Maximum: If you try to set a memory limit greater than 1GB (e.g., 1200MB or 1800MB), Kubernetes will throw an error. The error message will explicitly state that the maximum memory for your container is 1GB. * Example YAML part leading to error:

resources:
  limits:
    memory: "1200Mi" # This will cause an error

The command kubectl apply -f mem-too-high-limit.yaml would result in an error similar to: "The maximum memory for your container is 1GB and you have increased the limit to 1200MB, which is not possible". * Error Scenario: Request Below Minimum: If you attempt to set a memory request less than 500MB (e.g., 300MB), it will also result in an error. The error message will indicate that the minimum memory usage per container is 500MB. * Example YAML part leading to error:

resources:
  requests:
    memory: "300Mi" # This will cause an error

The command kubectl apply -f mem-too-low-request.yaml would result in an error similar to: "The minimum memory usage per container is 500MB, but you have requested 300MB which is not possible".

These resource quota configurations ensure that containers operate within defined boundaries, preventing resource hogging and promoting cluster stability. It is important to note that these default limits apply to containers within a specific namespace when you explicitly define it, not necessarily to the default namespace.

Horizontal Pod Autoscaling (HPA)#

Horizontal Pod Autoscaling (HPA) is a Kubernetes feature that automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or other custom metrics. It's a critical component for achieving automation in modern DevOps environments, as manual scaling is inefficient and impractical for fluctuating loads.

Concept and Need for HPA: Imagine an application like Hotstar during a major cricket match, experiencing over 1.2 crore (12 million) viewers. Without auto-scaling, handling such sudden and massive increases in user traffic would require manual intervention, leading to potential service disruptions. HPA addresses this by allowing your application to automatically create new pods (horizontal scaling) when the load increases, and reduce pods when the load decreases.

Core Components of HPA: HPA relies on several key components to function effectively: 1. Scalable Object: HPA scales objects like Deployments, ReplicaSets, or Replication Controllers. It cannot scale every Kubernetes object, such as Services. 2. Horizontal Pod Autoscaler (HPA) Object: This is a dedicated Kubernetes API resource and a controller that defines the auto-scaling behavior. It monitors the metrics and makes decisions to scale up or down. 3. Metric Server: This is a crucial add-on that needs to be installed in your cluster. Its primary role is to collect resource metrics (like CPU utilization, memory utilization, storage utilization) from all running pods. The HPA controller constantly queries the Metric Server for this data.

How HPA Makes Scaling Decisions: * Metric Collection: The Metric Server continuously collects data on CPU utilization (and other configured metrics) for all running pods. * Average Utilization: HPA evaluates the average CPU utilization across all pods of a given scalable object (e.g., a Deployment). For example, if you have two containers, one with 40% load and another with 20% load, the average would be (40+20)/2 = 30%. * Target Comparison: The HPA configuration includes a target CPU utilization percentage (e.g., 20% or 30%). If the calculated average utilization exceeds this target, HPA will trigger a scale-up event. * Pod Creation/Deletion: Based on the average utilization relative to the target, HPA calculates how many new pods are needed to bring the average utilization down to the target. It then instructs the Deployment or ReplicaSet to create (or delete) pods accordingly. For instance, if one pod has 80% CPU utilization and the target is 30%, HPA might create two more pods, so the 80% load can be distributed, aiming for an average closer to the 30% target.

Scaling Parameters: When configuring HPA, you define key parameters: * Target CPU Percentage: The desired average CPU utilization for your pods (e.g., 20% or 30%). * Minimum Replicas: The minimum number of pods that should always be running, even if the load is very low (e.g., 1 pod). HPA will not scale down below this number. * Maximum Replicas: The maximum number of pods HPA can create (e.g., 10 pods). This prevents uncontrolled scaling and resource exhaustion.

Downscaling Behavior and Cooling Period: While HPA scales up quickly when load increases, it has a more cautious approach to scaling down. * Cooling Period: To prevent rapid, unnecessary scaling fluctuations (known as "thrashing"), HPA has a default cooling period of 5 minutes before it will delete pods. This means that even if the load drops significantly, HPA will wait for 5 minutes to confirm that the low load persists before downscaling. If the load increases again within this 5-minute window, it avoids deleting pods unnecessarily. * Check Frequency: The HPA controller checks the resource metrics from the Metric Server and re-evaluates scaling decisions every 30 seconds.

Horizontal vs. Vertical Autoscaling: It is important to distinguish between these two types of scaling: * Horizontal Scaling: Involves creating new pods to distribute the load across more instances. This is like adding more rooms to a house when more family members arrive. * Vertical Scaling: Involves increasing the capacity of existing pods (e.g., allocating more CPU or memory to an already running pod). This is like expanding an existing room to accommodate more people. The video uses an analogy of increasing the capacity of an existing 50GB storage to 100GB (vertical) versus adding a completely new 50GB storage (horizontal). HPA focuses on horizontal scaling.

Practical Implementation Steps for HPA#

To demonstrate HPA, you need to set up the Metric Server, create a sample application Deployment, and then configure the HPA object.

Setting up the Metric Server: The Metric Server is not installed by default in Kubernetes and is essential for HPA to work.
- Download the Metric Server YAML: You typically download it from its GitHub repository, which is an open-source project. Command: wget https://raw.githubusercontent.com/kubernetes-sigs/metrics-server/master/deploy/1.8%2B/metrics-server-deployment.yaml (Note: The exact URL might vary slightly depending on the Kubernetes version and Metric Server release. The source indicates a GitHub link for download).
- Modify the Metric Server YAML: After downloading, you need to edit the metrics-server-deployment.yaml file. Inside the Deployment section, find the args (arguments) for the metrics-server container. You must add the argument --kubelet-insecure-tls to bypass certificate validation, as you typically won't have self-signed certificates set up in a basic lab environment.
  - YAML modification example (inside containers[].args for metrics-server):
```
# ... (other parts of metrics-server-deployment.yaml)
containers:
- name: metrics-server
  image: k8s.gcr.io/metrics-server/metrics-server:v0.5.0 # Example image
  args:
    - --cert-dir=/tmp
    - --secure-port=4443
    - --kubelet-insecure-tls # Add this line
    - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
    - --kubelet-use-node-status-port
# ...
```
- Apply the Metric Server YAML: After modification, apply the YAML file to deploy the Metric Server. Command: kubectl apply -f metrics-server-deployment.yaml.
- Verify Metric Server: You can check the pods in the kube-system namespace to ensure the Metric Server pod is running. You can also view its logs to confirm it's collecting metrics. Commands:
  - kubectl get pods -n kube-system
  - kubectl logs -f <metrics-server-pod-name> -n kube-system (Look for "Generated self-signed certificate" to confirm proper setup).

Creating a Sample Deployment: Next, create a sample application deployment that HPA will manage. This deployment will typically include resource requests and limits.

Deployment YAML structure example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deploy # Name of your deployment
spec:
  replicas: 1 # Start with 1 replica for demonstration
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-container # Name of your container
        image: httpd:latest # Example image (Apache server)
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "200m" # Request 200 milliCPU
          limits:
            cpu: "500m" # Limit 500 milliCPU

This YAML defines a deployment named my-deploy with an Apache server running in a container, requesting 200m CPU and limited to 500m CPU. It starts with one replica.

Apply the Deployment YAML: Command: kubectl apply -f my-deployment.yaml.
Verify Deployment and Pod: Command: kubectl get all (You should see your deployment, replica set, and one running pod).

Configuring the HPA Object: Now, create the HPA object that will monitor and scale your deployment.
- HPA creation command: Command: kubectl autoscale deployment my-deploy --cpu-percent=20 --min=1 --max=10.
  - deployment my-deploy: Specifies that HPA should scale the my-deploy deployment.
  - --cpu-percent=20: Sets the target average CPU utilization to 20%.
  - --min=1: Sets the minimum number of pods to 1.
  - --max=10: Sets the maximum number of pods to 10.
- Verify HPA: After running the command, you can check the HPA status. Command: kubectl get hpa (You will see the HPA named my-deploy, target CPU, min/max pods, and current replicas). Initially, the target CPU might show 0% or a low value if no load is present.
Demonstrating Autoscaling: To see HPA in action, you need to simulate a load increase on your application.
- Open Two Terminals: Keep one terminal for monitoring and another for generating load.
- Monitoring Terminal: In the first terminal, continuously monitor your pods and HPA. Command: watch kubectl get all (This command will refresh every 2 seconds, showing changes in pod counts and HPA metrics). You will observe the TARGET column for your HPA.
- Load Generation Terminal: In the second terminal, get a shell into your running application pod. Command: kubectl exec -it <my-deploy-pod-name> -- /bin/bash.
- Generate Load: Inside the pod's shell, run a command that consumes CPU, such as an apt update or a simple infinite loop. Command (example for CPU load): while true; do true; done (This command will make the CPU usage of the container spike). Alternatively, if your image has apt, you can run apt update repeatedly.
- Observe Scaling Up: As you run the load generation command, watch the monitoring terminal. You will see the TARGET CPU percentage of your HPA increase from 0% (or its initial low value). Once it consistently exceeds 20%, you will observe new pods being created automatically. The REPLICAS count in kubectl get all will increase, and new pod entries will appear as Running.
- Observe Scaling Down: Once you stop the load generation (e.g., by pressing Ctrl+C in the load generation terminal), the CPU utilization will drop. Due to the 5-minute cooling period, the extra pods will not be deleted immediately. After approximately 5 minutes, if the load remains low, HPA will automatically delete the excess pods, returning the replica count closer to the minimum defined value (e.g., 1).

This practical demonstration highlights how Kubernetes, through HPA and Metric Server, can automatically manage application scaling based on real-time resource utilization, making your applications resilient and efficient in handling varying loads.