Skip to main content

Introduction to Observability

Updated Mar 28, 2023 ·

Overview

Observability is the practice of understanding and measuring the internal state of a system using the data it generates.

  • Provides actionable insights for unexpected scenarios.
  • Improves visibility into application behavior.
  • Speed up troubleshooting, detects problems and monitor performance.

When troubleshooting issues, understanding the root cause is essential. Observability helps identify:

  • Why errors are increasing.
  • Why latency is high.
  • Why services are timing out.

It provides the context needed to answer "why" questions and mitigate future occurrences.

Telemetry

Telemetry collects and analyzes data to monitor system performance. The main types of telemetry data are:

  • Metrics

    • Quantifiable measurements like CPU or memory usage.
    • Show trends and system health.
    • Collected at regular intervals.
  • Event

    • Discrete occurrences like logins or crashes.
    • Capture specific actions or anomalies.
    • Used for audits and troubleshooting.
  • Log

    • Time-stamped records of activities (e.g., errors).
    • Provide detailed insights for debugging.
    • Useful for root cause analysis.
  • Trace

    • Follows requests across distributed systems.
    • Identifies bottlenecks and delays.
    • Helps optimize performance.

Pillars of Observability

Logging

Logs capture detailed records of system events.

  • Textual data about actions or errors.
  • Supports debugging and specific incident analysis.
  • Provides historical context for root cause analysis.

While logs are essential, their verbosity can make it challenging to extract meaningful information.

Metrics

Metrics measure numerical data to monitor system health and trends.

  • Aggregate data like CPU usage, memory, and latency.
  • Enable trend analysis and alerting on thresholds.
  • Offer a high-level overview for capacity planning.

The data collected can be aggregated over time and graphed using visualization tools to identify trends over time. Metrics can contain:

  • CPU Load
  • Number of open files
  • HTTP Response times
  • Number of errors

In general, metrics has these main components:

  • Metric name
  • Value
  • Timestamp for the metric
  • Dimensions

Tracing

Tracks the flow of requests across distributed systems.

  • Maps relationships and dependencies between services.
  • Pinpoints bottlenecks or failures in multi-service environments.
  • Offers insights into end-to-end system performance.

Each trace has a trace-id that can be used to identify the request as it traverses the system. Individual events forming a trace are called spans and each span tracks the following:

  • Start time
  • Duration
  • Parent-id

Methods of Monitoring

Microservices-based applications are commonly divided into three main layers, each requiring a specific monitoring approach. According to Google's SRE Handbook, the layers and methods are:

LayerDescriptionMonitoring Method
UI LayerWebsite and applications for user interaction.Core Web Vitals
Service LayerMicroservices like payment, booking, and communication services.RED Method
Infrastructure LayerPhysical or virtual resources such as memory, disk, and CPU.USE Method

Google also introduced the Four Golden Signals, which cover metrics for both the service and infrastructure layers to evaluate performance and reliability.

RED Method

The RED Method is request-oriented, focusing on how well individual requests are handled.

MetricDescription
RateThe number of requests per second received by the service.
ErrorsThe number of failed requests or error rates in processing requests.
DurationThe time taken to serve a request, including latency or processing delays.

USE Method

The USE Method is resource-oriented, helping monitor the health of system resources like servers or containers.

MetricDescription
UtilizationMeasures the percentage of resource capacity in use (e.g., CPU at 70% utilization).
SaturationIndicates the extent of overloading or nearing capacity (e.g., high disk queue).
ErrorsTracks hardware or resource-level errors (e.g., disk I/O errors, failed connections).

Four Golden Signals

The Four Golden Signals provide a comprehensive view of system performance and reliability.

MetricDescription
LatencyThe time taken for a request to be completed, including successful and failed requests.
TrafficRepresents demand on the system, such as requests per second or data throughput.
ErrorsThe percentage or count of failed requests across the system.
SaturationThe system's load relative to its maximum capacity.

Core Web Vitals

Core Web Vitals are user experience metrics that measure website performance, particularly for the UI layer.

MetricDescription
Largest Contentful PaintMeasures the loading performance by tracking when the largest visible element appears.
First Input DelayTracks interactivity by measuring the delay in responding to the first user input.
Cumulative Layout ShiftEvaluates visual stability by measuring unexpected layout shifts during loading.

Methods of Collecting Metrics

Metrics can be collected using different approaches depending on the system's design and monitoring needs.

  • Push Method

    • Systems actively send metrics to a central monitoring server.
    • Ideal for applications behind firewalls or NAT.
    • Allows controlled, periodic data transmission.
  • Scrape Method

    • A monitoring server requests metrics from systems at regular intervals.
    • Commonly used in environments like Prometheus.
    • Ensures up-to-date data by pulling it on demand.

Which method to choose? Consider the following:

  • The type of systems or applications
  • Scalability requirements
  • Complexity of implementation

Service Level Concepts

Service Level Indicator (SLI)

A quantitative metric that measures specific aspects of system performance.

  • Represents key measurements like latency, error rates, or throughput.
  • Tracks system behavior to evaluate performance against objectives.

Examples

  • Latency
  • Availability

Service Level Objective (SLO)

A target or goal for system performance based on SLIs.

  • Defines acceptable thresholds for reliability or efficiency.
  • Guides operational priorities and service improvements.

Examples

  • Latency < 100ms
  • Availability - 99.9% uptime

It may be tempting to set them to aggresive values like 100% uptime however this will come at a higher cost. The goal is not to achieve perfection but instead to make customers happy with the right level of reliability.

Service Level Agreement (SLA)

A formal contract between a provider and user specifying service expectations.

  • Includes penalties or remedies if agreed targets are not met.
  • Ensures accountability and trust between parties.