Metrics13 min read

7 Key Metrics Tracked by Performance and Availability Monitoring Tools (2026)

DevOps Team

DevOps Engineers

January 30, 2026

Beyond Up or Down

Availability might seem binary—your service is either accessible or it isn't. In practice, availability exists on a spectrum. A website that technically responds but takes 30 seconds to load isn't meaningfully "up" for most purposes. Performance and availability tools track multiple metrics to provide a complete picture of service health.

Uptime Percentage

The most fundamental availability metric: what portion of time was the service accessible?

Calculation

Uptime percentage = (Total time - Downtime) / Total time × 100

If a service experienced 30 minutes of downtime over a 30-day month (43,200 minutes total):

(43,200 - 30) / 43,200 × 100 = 99.93% uptime

The Nines

Industry shorthand describes reliability in "nines":

Two nines (99%): 7.3 hours downtime/month
Three nines (99.9%): 43.8 minutes downtime/month
Four nines (99.99%): 4.4 minutes downtime/month
Five nines (99.999%): 26.3 seconds downtime/month

Each additional nine requires exponentially more effort to achieve. Most web services target three or four nines.

Limitations

Uptime percentage treats all downtime equally. Five minutes down at 3 AM matters less than five minutes during peak shopping hours. Some organizations track business-hours uptime separately from overall uptime.

Response Time / Latency

How long does it take your service to respond to requests?

Measuring Response Time

Response time typically measures the duration from sending a request to receiving the complete response. Components include:

DNS lookup time
TCP connection establishment
TLS handshake (for HTTPS)
Time to first byte (server processing)
Content transfer

Monitoring tools often break down total response time into these components, helping identify where slowness originates.

Latency vs. Response Time

Technically, latency refers to delay in the network—the time for data to travel from point A to point B. Response time includes latency plus server processing time. In practice, the terms are often used interchangeably.

Why It Matters

Slow response times directly impact user experience. Studies consistently show that response times above 3 seconds cause significant user abandonment. E-commerce conversion rates drop measurably with each additional second of load time.

Error Rates

What percentage of requests result in errors?

Types of Errors

4xx errors: Client errors (bad requests, not found, unauthorized). Some level is normal.
5xx errors: Server errors (internal errors, bad gateway, service unavailable). Usually indicates real problems.
Timeouts: Requests that never receive responses. Often worse than explicit errors.

Baseline vs. Anomaly

A 0.1% error rate might be perfectly normal for your application. A spike to 5% indicates a problem. Monitoring should track error rates over time to establish baselines and detect anomalies.

Availability vs. Reliability

These terms are related but distinct:

Availability

The proportion of time a service is operational. High availability means the service is almost always accessible.

Reliability

The probability that a service performs correctly when accessed. High reliability means requests succeed without errors.

A service could have 99.9% availability (rarely down) but poor reliability (frequent errors when up). Comprehensive monitoring tracks both dimensions.

SLA Metrics

Service Level Agreements (SLAs) formalize performance commitments. Common SLA metrics include:

Availability Commitment

"99.9% monthly uptime" commits to no more than 43.8 minutes of downtime per month. Monitoring verifies whether you meet this commitment.

Response Time Percentiles

"95th percentile response time under 500ms" means 95% of requests must complete within half a second. The worst 5% can exceed this threshold.

Error Rate Limits

"Error rate below 0.1%" sets a ceiling on acceptable failures.

Why SLAs Use Specific Metrics

SLAs define measurable, verifiable commitments. Vague promises like "fast and reliable" can't be objectively assessed. Specific metrics enable both parties to agree on whether commitments are met.

Why Averages Can Be Misleading

Average response time is a common but problematic metric.

The Problem

Consider two scenarios with identical 200ms average response times:

Scenario A: All requests complete between 180-220ms
Scenario B: 90% of requests complete in 100ms, 10% take 1100ms

The average masks dramatic differences in user experience. Scenario B has a severe problem affecting 10% of users.

Percentiles Tell the Real Story

Percentiles describe the distribution of values:

p50 (median): Half of requests are faster, half are slower
p90: 90% of requests are faster than this value
p95: 95% of requests are faster than this value
p99: 99% of requests are faster than this value

High percentiles reveal tail latency—the experience of your slowest requests. Users experiencing p99 latency have a significantly worse experience than the "average" suggests.

Choosing the Right Percentile

P50 indicates typical experience. P95 or P99 indicates worst common experience. For critical applications, track multiple percentiles to understand the full distribution.

Throughput

How many requests can your service handle over time?

Requests Per Second (RPS)

Common throughput measure for web services. Higher RPS indicates greater capacity. Monitoring tracks whether actual traffic approaches capacity limits.

Transactions Per Second (TPS)

For applications where a "transaction" involves multiple requests (e.g., a checkout flow), TPS provides a business-relevant measure.

Bandwidth

Data transferred per time unit. Important for media-heavy applications or services with large payloads.

Time-Based Patterns

Raw metrics gain meaning when analyzed over time:

Trend Analysis

Is response time gradually increasing? Are error rates creeping up? Trends indicate problems developing before they become critical.

Seasonality

Traffic patterns often follow predictable cycles—daily (busy during work hours), weekly (quiet on weekends), or seasonal (retail peaks in November). Understanding normal patterns helps identify anomalies.

Correlation

When metrics change together, they may share a cause. Response time increases during traffic spikes suggest capacity issues. Error rates rising after deployments indicate release problems.

Monitoring vs. Observability

These terms are sometimes used interchangeably but represent different approaches:

Monitoring

Watching known metrics for known failure modes. You define what to measure and set alerts for concerning values. Monitoring answers: "Is the thing I'm watching behaving normally?"

Observability

The ability to understand system behavior from external outputs. Observability tools collect logs, metrics, and traces that help investigate unknown problems. Observability answers: "Why is the system behaving this way?"

Complementary Approaches

Monitoring detects problems; observability helps diagnose them. External uptime monitoring tells you when your site is down. Internal observability tools help you understand why.

Composite Health Scores

Some tools combine multiple metrics into single health scores:

Benefits

Easier to communicate overall status to non-technical stakeholders
Single metric for dashboards and reports
Can trigger alerts when overall health degrades

Risks

Obscures which underlying metric caused score changes
Weighting between metrics involves subjective judgment
May mask problems in one area when others perform well

Composite scores work well for executive summaries but should supplement, not replace, individual metric visibility.

Choosing What to Track

You can't monitor everything. Prioritize metrics that:

Directly impact user experience
Indicate potential failures before they become critical
Support SLA compliance verification
Inform capacity planning decisions

Start with core availability metrics (uptime, response time, errors). Add detailed metrics as you understand your system's specific failure modes and performance characteristics.

Quick Reference: The 7 Core Metrics

Metric	What It Measures	Good Target	Warning Sign
Uptime %	% of time service is accessible	≥ 99.9%	< 99.5%
Response Time (p95)	Time to receive complete response (95th pct)	< 500ms	> 2 000ms
Error Rate	% of requests returning 5xx/timeouts	< 0.1%	> 1%
TTFB	Time to first byte (server processing)	< 200ms	> 800ms
Throughput (RPS)	Requests handled per second	Stable under load	Drops during traffic spikes
MTTR	Mean time to recover from incidents	< 15 min	> 60 min
SLA Compliance	% of time contracted uptime is met	100%	Any breach

Frequently Asked Questions

What is the difference between uptime and availability?

Uptime refers to the raw time a server or service is powered on and running. Availability is a broader measure that accounts for whether the service is actually usable — including response time and error rate. A server can be "up" while returning 503 errors to every user, making it unavailable in practice. Availability monitoring checks from the user's perspective, not just the server's perspective.

What is a good uptime percentage for a website?

99.9% ("three nines") is the standard baseline for most production web services, allowing up to 43.8 minutes of downtime per month. E-commerce stores and SaaS platforms typically target 99.99% (four nines, ~4.4 minutes/month). Achieving five nines (99.999%, ~26 seconds/month) requires significant infrastructure investment and is typically reserved for mission-critical systems.

What does p99 response time mean?

P99 (99th percentile) response time means that 99% of all requests completed faster than this value — only 1% of requests were slower. If your p99 response time is 2,000ms, that means 1 in 100 users waits more than 2 seconds for a response. Monitoring percentiles (p50, p95, p99) reveals tail latency that averages completely hide, giving a realistic picture of your slowest user experiences.

How is uptime percentage calculated?

Uptime % = (Total time in period − Total downtime) ÷ Total time × 100. For example, if your site was down for 20 minutes in a 30-day month (43,200 minutes): (43,200 − 20) ÷ 43,200 × 100 = 99.95%. Monitoring tools calculate this automatically based on check results, giving you monthly and yearly uptime reports without manual tracking.

What is MTTR in monitoring?

MTTR (Mean Time to Recover) is the average time from when an incident is detected to when the service is fully restored. It combines detection time (how quickly your monitoring alerts you) and resolution time (how quickly your team fixes the issue). Reducing MTTR is one of the highest-ROI reliability investments — faster alerting directly reduces MTTR even without changing your code.

What is the difference between monitoring and observability?

Monitoring watches predefined metrics for known failure modes and alerts when thresholds are exceeded — it answers "is something wrong?" Observability is the ability to understand system behavior from its outputs (logs, metrics, traces) when something unexpected happens — it answers "why is it wrong?" External uptime monitoring is the simplest form of monitoring. Observability platforms like Datadog or Grafana provide deeper internal visibility. Both are complementary: monitoring catches problems, observability helps diagnose them.

How often should performance metrics be collected?

External uptime monitoring typically checks every 1–5 minutes, which is sufficient to detect outages. For performance metrics (response time, error rate, throughput), collection every 30–60 seconds provides good resolution for dashboards. For real-time anomaly detection in high-traffic systems, 10–15 second intervals or even per-request sampling through APM tools may be needed. More frequent collection generates more data — balance granularity against storage and analysis costs.

#metrics #uptime percentage #latency #SLA #performance #availability

Share this article

About the Author

DevOps Team

DevOps Engineers

Our DevOps team brings decades of experience in building and maintaining reliable infrastructure.

Stay Updated

Get the latest articles and monitoring tips delivered to your inbox.

Start Free Trial