7 Key Metrics Tracked by Performance and Availability Monitoring Tools (2026)
Beyond Up or Down
Availability might seem binary—your service is either accessible or it isn't. In practice, availability exists on a spectrum. A website that technically responds but takes 30 seconds to load isn't meaningfully "up" for most purposes. Performance and availability tools track multiple metrics to provide a complete picture of service health.
Uptime Percentage
The most fundamental availability metric: what portion of time was the service accessible?
Calculation
Uptime percentage = (Total time - Downtime) / Total time × 100
If a service experienced 30 minutes of downtime over a 30-day month (43,200 minutes total):
(43,200 - 30) / 43,200 × 100 = 99.93% uptime
The Nines
Industry shorthand describes reliability in "nines":
- Two nines (99%): 7.3 hours downtime/month
- Three nines (99.9%): 43.8 minutes downtime/month
- Four nines (99.99%): 4.4 minutes downtime/month
- Five nines (99.999%): 26.3 seconds downtime/month
Each additional nine requires exponentially more effort to achieve. Most web services target three or four nines.
Limitations
Uptime percentage treats all downtime equally. Five minutes down at 3 AM matters less than five minutes during peak shopping hours. Some organizations track business-hours uptime separately from overall uptime.
Response Time / Latency
How long does it take your service to respond to requests?
Measuring Response Time
Response time typically measures the duration from sending a request to receiving the complete response. Components include:
- DNS lookup time
- TCP connection establishment
- TLS handshake (for HTTPS)
- Time to first byte (server processing)
- Content transfer
Monitoring tools often break down total response time into these components, helping identify where slowness originates.
Latency vs. Response Time
Technically, latency refers to delay in the network—the time for data to travel from point A to point B. Response time includes latency plus server processing time. In practice, the terms are often used interchangeably.
Why It Matters
Slow response times directly impact user experience. Studies consistently show that response times above 3 seconds cause significant user abandonment. E-commerce conversion rates drop measurably with each additional second of load time.
Error Rates
What percentage of requests result in errors?
Types of Errors
- 4xx errors: Client errors (bad requests, not found, unauthorized). Some level is normal.
- 5xx errors: Server errors (internal errors, bad gateway, service unavailable). Usually indicates real problems.
- Timeouts: Requests that never receive responses. Often worse than explicit errors.
Baseline vs. Anomaly
A 0.1% error rate might be perfectly normal for your application. A spike to 5% indicates a problem. Monitoring should track error rates over time to establish baselines and detect anomalies.
Availability vs. Reliability
These terms are related but distinct:
Availability
The proportion of time a service is operational. High availability means the service is almost always accessible.
Reliability
The probability that a service performs correctly when accessed. High reliability means requests succeed without errors.
A service could have 99.9% availability (rarely down) but poor reliability (frequent errors when up). Comprehensive monitoring tracks both dimensions.
SLA Metrics
Service Level Agreements (SLAs) formalize performance commitments. Common SLA metrics include:
Availability Commitment
"99.9% monthly uptime" commits to no more than 43.8 minutes of downtime per month. Monitoring verifies whether you meet this commitment.
Response Time Percentiles
"95th percentile response time under 500ms" means 95% of requests must complete within half a second. The worst 5% can exceed this threshold.
Error Rate Limits
"Error rate below 0.1%" sets a ceiling on acceptable failures.
Why SLAs Use Specific Metrics
SLAs define measurable, verifiable commitments. Vague promises like "fast and reliable" can't be objectively assessed. Specific metrics enable both parties to agree on whether commitments are met.
Why Averages Can Be Misleading
Average response time is a common but problematic metric.
The Problem
Consider two scenarios with identical 200ms average response times:
- Scenario A: All requests complete between 180-220ms
- Scenario B: 90% of requests complete in 100ms, 10% take 1100ms
The average masks dramatic differences in user experience. Scenario B has a severe problem affecting 10% of users.
Percentiles Tell the Real Story
Percentiles describe the distribution of values:
- p50 (median): Half of requests are faster, half are slower
- p90: 90% of requests are faster than this value
- p95: 95% of requests are faster than this value
- p99: 99% of requests are faster than this value
High percentiles reveal tail latency—the experience of your slowest requests. Users experiencing p99 latency have a significantly worse experience than the "average" suggests.
Choosing the Right Percentile
P50 indicates typical experience. P95 or P99 indicates worst common experience. For critical applications, track multiple percentiles to understand the full distribution.
Throughput
How many requests can your service handle over time?
Requests Per Second (RPS)
Common throughput measure for web services. Higher RPS indicates greater capacity. Monitoring tracks whether actual traffic approaches capacity limits.
Transactions Per Second (TPS)
For applications where a "transaction" involves multiple requests (e.g., a checkout flow), TPS provides a business-relevant measure.
Bandwidth
Data transferred per time unit. Important for media-heavy applications or services with large payloads.
Time-Based Patterns
Raw metrics gain meaning when analyzed over time:
Trend Analysis
Is response time gradually increasing? Are error rates creeping up? Trends indicate problems developing before they become critical.
Seasonality
Traffic patterns often follow predictable cycles—daily (busy during work hours), weekly (quiet on weekends), or seasonal (retail peaks in November). Understanding normal patterns helps identify anomalies.
Correlation
When metrics change together, they may share a cause. Response time increases during traffic spikes suggest capacity issues. Error rates rising after deployments indicate release problems.
Monitoring vs. Observability
These terms are sometimes used interchangeably but represent different approaches:
Monitoring
Watching known metrics for known failure modes. You define what to measure and set alerts for concerning values. Monitoring answers: "Is the thing I'm watching behaving normally?"
Observability
The ability to understand system behavior from external outputs. Observability tools collect logs, metrics, and traces that help investigate unknown problems. Observability answers: "Why is the system behaving this way?"
Complementary Approaches
Monitoring detects problems; observability helps diagnose them. External uptime monitoring tells you when your site is down. Internal observability tools help you understand why.
Composite Health Scores
Some tools combine multiple metrics into single health scores:
Benefits
- Easier to communicate overall status to non-technical stakeholders
- Single metric for dashboards and reports
- Can trigger alerts when overall health degrades
Risks
- Obscures which underlying metric caused score changes
- Weighting between metrics involves subjective judgment
- May mask problems in one area when others perform well
Composite scores work well for executive summaries but should supplement, not replace, individual metric visibility.
Choosing What to Track
You can't monitor everything. Prioritize metrics that:
- Directly impact user experience
- Indicate potential failures before they become critical
- Support SLA compliance verification
- Inform capacity planning decisions
Start with core availability metrics (uptime, response time, errors). Add detailed metrics as you understand your system's specific failure modes and performance characteristics.
Quick Reference: The 7 Core Metrics
| Metric | What It Measures | Good Target | Warning Sign |
|---|---|---|---|
| Uptime % | % of time service is accessible | ≥ 99.9% | < 99.5% |
| Response Time (p95) | Time to receive complete response (95th pct) | < 500ms | > 2 000ms |
| Error Rate | % of requests returning 5xx/timeouts | < 0.1% | > 1% |
| TTFB | Time to first byte (server processing) | < 200ms | > 800ms |
| Throughput (RPS) | Requests handled per second | Stable under load | Drops during traffic spikes |
| MTTR | Mean time to recover from incidents | < 15 min | > 60 min |
| SLA Compliance | % of time contracted uptime is met | 100% | Any breach |
Frequently Asked Questions
What is the difference between uptime and availability?
Uptime refers to the raw time a server or service is powered on and running. Availability is a broader measure that accounts for whether the service is actually usable — including response time and error rate. A server can be "up" while returning 503 errors to every user, making it unavailable in practice. Availability monitoring checks from the user's perspective, not just the server's perspective.
What is a good uptime percentage for a website?
99.9% ("three nines") is the standard baseline for most production web services, allowing up to 43.8 minutes of downtime per month. E-commerce stores and SaaS platforms typically target 99.99% (four nines, ~4.4 minutes/month). Achieving five nines (99.999%, ~26 seconds/month) requires significant infrastructure investment and is typically reserved for mission-critical systems.
What does p99 response time mean?
P99 (99th percentile) response time means that 99% of all requests completed faster than this value — only 1% of requests were slower. If your p99 response time is 2,000ms, that means 1 in 100 users waits more than 2 seconds for a response. Monitoring percentiles (p50, p95, p99) reveals tail latency that averages completely hide, giving a realistic picture of your slowest user experiences.
How is uptime percentage calculated?
Uptime % = (Total time in period − Total downtime) ÷ Total time × 100. For example, if your site was down for 20 minutes in a 30-day month (43,200 minutes): (43,200 − 20) ÷ 43,200 × 100 = 99.95%. Monitoring tools calculate this automatically based on check results, giving you monthly and yearly uptime reports without manual tracking.
What is MTTR in monitoring?
MTTR (Mean Time to Recover) is the average time from when an incident is detected to when the service is fully restored. It combines detection time (how quickly your monitoring alerts you) and resolution time (how quickly your team fixes the issue). Reducing MTTR is one of the highest-ROI reliability investments — faster alerting directly reduces MTTR even without changing your code.
What is the difference between monitoring and observability?
Monitoring watches predefined metrics for known failure modes and alerts when thresholds are exceeded — it answers "is something wrong?" Observability is the ability to understand system behavior from its outputs (logs, metrics, traces) when something unexpected happens — it answers "why is it wrong?" External uptime monitoring is the simplest form of monitoring. Observability platforms like Datadog or Grafana provide deeper internal visibility. Both are complementary: monitoring catches problems, observability helps diagnose them.
How often should performance metrics be collected?
External uptime monitoring typically checks every 1–5 minutes, which is sufficient to detect outages. For performance metrics (response time, error rate, throughput), collection every 30–60 seconds provides good resolution for dashboards. For real-time anomaly detection in high-traffic systems, 10–15 second intervals or even per-request sampling through APM tools may be needed. More frequent collection generates more data — balance granularity against storage and analysis costs.
Share this article
About the Author
DevOps Team
DevOps Engineers
Our DevOps team brings decades of experience in building and maintaining reliable infrastructure.