Mastering Application Health Checks and Readiness Probes

[!IMPORTANT] For the AWS Certified Developer - Associate (DVA-C02) exam, understanding health checks is critical for ensuring high availability and automating the replacement of failing application components.

Learning Objectives

By the end of this guide, you will be able to:

Differentiate between Liveness and Readiness concepts in AWS environments.
Configure Application Load Balancer (ALB) and Network Load Balancer (NLB) target group health checks.
Implement a robust health check endpoint within your application code.
Define the relationship between Auto Scaling Group (ASG) health checks and Load Balancer health checks.
Configure Route 53 health checks for DNS-level failover.

Key Terms & Glossary

Health Check: A periodic request sent by an AWS service (like ELB) to an application instance to verify its status.
Readiness Probe: A check to determine if an application is ready to accept traffic (e.g., after it has finished loading a cache or connecting to a database).
Grace Period: The amount of time an Auto Scaling group waits before checking the health status of a newly launched instance.
Healthy/Unhealthy Threshold: The number of consecutive successful or failed probes required to change an instance's status.
Deep Health Check: A health check that validates not just the web server, but also downstream dependencies like databases or external APIs.

The "Big Idea"

In a distributed system, individual components will inevitably fail. Health checks act as the "system pulse." Without them, a Load Balancer would blindly send traffic to a "zombie" instance—one that is running but unable to process requests—leading to user-facing errors. By effectively configuring health checks, you enable AWS to automatically route traffic away from failing components and trigger self-healing mechanisms.

Formula / Concept Box

Parameter	Standard Default (ALB)	Purpose
Path	`/`	The destination URL for the health check (e.g., `/health`).
Port	`traffic-port`	The port the load balancer uses to send health checks.
Healthy Threshold	5	Consecutive successes to mark as "Healthy".
Unhealthy Threshold	2	Consecutive failures to mark as "Unhealthy".
Interval	30 seconds	Time between individual health check probes.
Timeout	5 seconds	Time to wait for a response before counting as a failure.
Success Codes	`200`	The HTTP status code(s) indicating a healthy response.

Hierarchical Outline

Elastic Load Balancing (ELB) Health Checks
- Target Groups: Health checks are configured at the Target Group level, not the Load Balancer level.
- Status Codes: You can specify a range (e.g., 200-399) for flexibility.
Auto Scaling Group (ASG) Integration
- EC2 Status Checks: Default check; only sees if the VM is up.
- ELB Health Checks: Must be enabled manually. If the ELB marks an instance unhealthy, the ASG terminates and replaces it.
Application-Level Implementation
- Shallow Checks: Verifies only the web server is responding (e.g., static file).
- Deep Checks: Verifies DB connectivity, memory usage, and background thread status.
Route 53 Health Checks
- Public Endpoints: Monitors public-facing IP addresses or domain names.
- CloudWatch Alarms: Can trigger DNS failover based on alarm status.

Visual Anchors

Health Check Lifecycle

Loading Diagram...

Load Balancer vs. Target Health

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Shallow Health Check: A simple check of the web server availability.
- Example: Checking if index.html loads. It is fast but doesn't guarantee the app can actually talk to the database.
Readiness Probe: A check that ensures the application is fully initialized.
- Example: A Java Spring Boot application that must wait for its Hibernate connection pool to initialize before it returns a 200 OK on /ready.
Health Check Grace Period: A delay before the ASG starts killing instances.
- Example: If your app takes 5 minutes to start, set the grace period to 300 seconds so the ASG doesn't kill it for being "unhealthy" while it's still booting up.

Worked Examples

Scenario 1: Configuring an ALB Health Check Path

You have a Python Flask application. You need to ensure the ALB only sends traffic if the database connection is alive.

Application Code:

python

@app.route('/health')
def health_check():
    try:
        db.session.execute('SELECT 1')
        return "Healthy", 200
    except Exception:
        return "Service Unavailable", 503

AWS Configuration:
- Navigate to Target Groups in the EC2 Console.
- Select your target group -> Health checks tab -> Edit.
- Health check path: /health.
- Success codes: 200.

Scenario 2: ASG Integration

You notice that even though your ALB shows instances as "Unhealthy," the Auto Scaling Group is not replacing them.

Fix: By default, ASG only uses EC2 status checks (hardware/system level). You must go to the ASG settings and change the Health Check Type to ELB. This allows the ASG to use the ALB's granular application-level health check results to trigger instance replacement.

Checkpoint Questions

What happens to a connection that is already in progress if an instance becomes "Unhealthy"? (Answer: The ALB allows the request to complete—known as Deregistration Delay or Connection Draining—but sends no new requests).
If your application takes 2 minutes to download dependencies during startup, what ASG setting must you adjust? (Answer: Increase the Health Check Grace Period).
True or False: Route 53 can use an alias record to evaluate the health of an ALB. (Answer: True, this is called "Evaluate Target Health").
Can a health check be configured to look for a specific string in the response body? (Answer: No, ELB health checks only look at the HTTP status code).