Mastering Application Health Checks and Readiness Probes
Configure application health checks and readiness probes
Mastering Application Health Checks and Readiness Probes
[!IMPORTANT] For the AWS Certified Developer - Associate (DVA-C02) exam, understanding health checks is critical for ensuring high availability and automating the replacement of failing application components.
Learning Objectives
By the end of this guide, you will be able to:
- Differentiate between Liveness and Readiness concepts in AWS environments.
- Configure Application Load Balancer (ALB) and Network Load Balancer (NLB) target group health checks.
- Implement a robust health check endpoint within your application code.
- Define the relationship between Auto Scaling Group (ASG) health checks and Load Balancer health checks.
- Configure Route 53 health checks for DNS-level failover.
Key Terms & Glossary
- Health Check: A periodic request sent by an AWS service (like ELB) to an application instance to verify its status.
- Readiness Probe: A check to determine if an application is ready to accept traffic (e.g., after it has finished loading a cache or connecting to a database).
- Grace Period: The amount of time an Auto Scaling group waits before checking the health status of a newly launched instance.
- Healthy/Unhealthy Threshold: The number of consecutive successful or failed probes required to change an instance's status.
- Deep Health Check: A health check that validates not just the web server, but also downstream dependencies like databases or external APIs.
The "Big Idea"
In a distributed system, individual components will inevitably fail. Health checks act as the "system pulse." Without them, a Load Balancer would blindly send traffic to a "zombie" instance—one that is running but unable to process requests—leading to user-facing errors. By effectively configuring health checks, you enable AWS to automatically route traffic away from failing components and trigger self-healing mechanisms.
Formula / Concept Box
| Parameter | Standard Default (ALB) | Purpose |
|---|---|---|
| Path | / | The destination URL for the health check (e.g., /health). |
| Port | traffic-port | The port the load balancer uses to send health checks. |
| Healthy Threshold | 5 | Consecutive successes to mark as "Healthy". |
| Unhealthy Threshold | 2 | Consecutive failures to mark as "Unhealthy". |
| Interval | 30 seconds | Time between individual health check probes. |
| Timeout | 5 seconds | Time to wait for a response before counting as a failure. |
| Success Codes | 200 | The HTTP status code(s) indicating a healthy response. |
Hierarchical Outline
- Elastic Load Balancing (ELB) Health Checks
- Target Groups: Health checks are configured at the Target Group level, not the Load Balancer level.
- Status Codes: You can specify a range (e.g.,
200-399) for flexibility.
- Auto Scaling Group (ASG) Integration
- EC2 Status Checks: Default check; only sees if the VM is up.
- ELB Health Checks: Must be enabled manually. If the ELB marks an instance unhealthy, the ASG terminates and replaces it.
- Application-Level Implementation
- Shallow Checks: Verifies only the web server is responding (e.g., static file).
- Deep Checks: Verifies DB connectivity, memory usage, and background thread status.
- Route 53 Health Checks
- Public Endpoints: Monitors public-facing IP addresses or domain names.
- CloudWatch Alarms: Can trigger DNS failover based on alarm status.
Visual Anchors
Health Check Lifecycle
Load Balancer vs. Target Health
\begin{tikzpicture}[node distance=2cm, every node/.style={draw, rectangle, align=center, minimum width=2.5cm}] \node (User) [draw=none] {User}; \node (ALB) [right of=User, xshift=1cm] {ALB}; \node (H) [right of=ALB, yshift=1cm, xshift=1cm, fill=green!20] {Target A$Healthy)}; \node (U) [right of=ALB, yshift=-1cm, xshift=1cm, fill=red!20] {Target B$Unhealthy)};
\draw[->, thick] (User) -- (ALB);
\draw[->, thick] (ALB) -- (H) node[midway, above, sloped, draw=none] {Traffic OK};
\draw[->, dashed, red] (ALB) -- (U) node[midway, below, sloped, draw=none] {Blocked};
\draw[<->, blue] (ALB) edge[bend right=45] node[right, draw=none] {Health Probe} (U);\end{tikzpicture}
Definition-Example Pairs
- Shallow Health Check: A simple check of the web server availability.
- Example: Checking if
index.htmlloads. It is fast but doesn't guarantee the app can actually talk to the database.
- Example: Checking if
- Readiness Probe: A check that ensures the application is fully initialized.
- Example: A Java Spring Boot application that must wait for its Hibernate connection pool to initialize before it returns a
200 OKon/ready.
- Example: A Java Spring Boot application that must wait for its Hibernate connection pool to initialize before it returns a
- Health Check Grace Period: A delay before the ASG starts killing instances.
- Example: If your app takes 5 minutes to start, set the grace period to 300 seconds so the ASG doesn't kill it for being "unhealthy" while it's still booting up.
Worked Examples
Scenario 1: Configuring an ALB Health Check Path
You have a Python Flask application. You need to ensure the ALB only sends traffic if the database connection is alive.
- Application Code:
@app.route('/health')
def health_check():
try:
db.session.execute('SELECT 1')
return "Healthy", 200
except Exception:
return "Service Unavailable", 503- AWS Configuration:
- Navigate to Target Groups in the EC2 Console.
- Select your target group -> Health checks tab -> Edit.
- Health check path:
/health. - Success codes:
200.
Scenario 2: ASG Integration
You notice that even though your ALB shows instances as "Unhealthy," the Auto Scaling Group is not replacing them.
- Fix: By default, ASG only uses EC2 status checks (hardware/system level). You must go to the ASG settings and change the Health Check Type to
ELB. This allows the ASG to use the ALB's granular application-level health check results to trigger instance replacement.
Checkpoint Questions
- What happens to a connection that is already in progress if an instance becomes "Unhealthy"? (Answer: The ALB allows the request to complete—known as Deregistration Delay or Connection Draining—but sends no new requests).
- If your application takes 2 minutes to download dependencies during startup, what ASG setting must you adjust? (Answer: Increase the Health Check Grace Period).
- True or False: Route 53 can use an alias record to evaluate the health of an ALB. (Answer: True, this is called "Evaluate Target Health").
- Can a health check be configured to look for a specific string in the response body? (Answer: No, ELB health checks only look at the HTTP status code).