OpenTelemetry (OTel)

Observability for Modern Web Systems

"OpenTelemetry is the open-source, vendor-neutral standard for generating, collecting, and exporting telemetry data — traces, metrics, and logs — so you can understand what is happening inside distributed systems."

CSE 135 — Full Overview

← → sections • ↓ more detail

Sections 1–2Observability & The Three Pillars

What observability means and the three signal types it relies on.

What is Observability?

From control theory: a system is observable if you can determine its internal state from its external outputs.

Monitoring	Observability
Answers known questions ("Is the server up?")	Answers unknown questions ("Why is checkout slow for users in Brazil?")
Predefined alerts	Ad hoc investigation
Explanatory Dataviz	Exploration Dataviz
Works well for monoliths	Essential for distributed systems
Focuses on symptoms	Focuses on root causes

Key insight: Monitoring tells you that something is wrong. Observability helps you understand why.

LLM Worry: LLMs may help explain system behavior, but they favor known unknowns. For genuinely new issues (unknown unknowns), always validate LLM-generated insights with domain expertise.

Pillar	What It Is	Strengths	Weaknesses
Logs	Discrete, timestamped event records	Rich context, human-readable	High volume, hard to correlate
Metrics	Numerical values aggregated over time	Low overhead, great for dashboards	Aggregation loses detail
Traces	Request journey through distributed services	Shows causality & latency breakdown	Complex to instrument & store

Component	What It Is
Specification	Language-agnostic definitions of how signals are structured, propagated, exported
API	Interfaces that library authors instrument against — zero-dependency
SDK	Implementation that app owners install to collect and export data
Instrumentation Libraries	Pre-built integrations for Express, Flask, Spring, etc.
Collector	Standalone binary: receives, processes, exports telemetry
OTLP	Wire format (gRPC/HTTP protobuf) for transmitting to any backend

Field	Description
`traceId`	Links this span to its trace (128-bit hex)
`spanId`	Unique identifier for this span
`parentSpanId`	The span that caused this span (tree structure)
`name`	Operation name: `GET /api/users`
`startTime` / `endTime`	When the work began and ended
`status`	`OK`, `ERROR`, or `UNSET`
`attributes`	Key-value pairs: `http.method=GET`, `db.system=mysql`
`events`	Timestamped annotations (e.g., "cache miss")

Instrument	What It Records	Example
Counter	Monotonically increasing value	`http.requests.total` — always goes up
Histogram	Distribution of values (bucketized)	`http.request.duration` — p50, p95, p99
Gauge	Point-in-time value, goes up and down	`system.memory.usage` — current memory

Strategy	Where	How It Works	Tradeoff
Head-based	SDK (at trace start)	Decide before trace begins — e.g., keep 10% randomly	Simple, but may discard errors
Tail-based	Collector (after trace)	Collect all spans, then decide based on content	Keeps errors/outliers, costs memory
Rule-based	SDK or Collector	Always keep errors, latency > 2s, specific endpoints	Best of both, more complex config

Category	Attribute	Example
HTTP	`http.request.method`	`GET`
HTTP	`http.response.status_code`	`200`
Database	`db.system`	`mysql`
Database	`db.statement`	`SELECT * FROM users WHERE id=?`
Service	`service.name`	`user-service`
Service	`service.version`	`2.1.0`

	Behavioral Analytics (Google Analytics)	Technical Analytics (OTel)
Purpose	What users do (pageviews, clicks, funnels)	How the system behaves (latency, errors)
Audience	Product managers, designers, marketers	Developers, SREs, DevOps
Questions	"Where do users drop off?"	"Why is this request slow?"
Data model	Events with user-centric attributes	Traces/spans with system-centric attributes
Collection	Client-side JS beacon	SDK auto-instrumentation + Collector
Standards	No universal standard (GA as de facto)	OTLP, W3C Trace Context

Concept	Key Takeaway
Observability	Understanding system internals from external outputs — goes beyond monitoring to answer "why"
Three pillars	Traces (request flow), Metrics (aggregated numbers), Logs (discrete events) — use all three
OpenTelemetry	Vendor-neutral standard for generating, collecting, and exporting telemetry
Architecture	App → OTel SDK → OTel Collector → Backend(s); instrument once, export anywhere
Context propagation	W3C `traceparent` carries trace/span IDs across service boundaries
Auto-instrumentation	Spans for HTTP, DB, RPC with zero code changes; manual spans for business logic
Collector	Vendor-agnostic pipeline (receivers → processors → exporters)
Sampling	Head-based (fast), tail-based (smart) — keep errors and outliers, sample the rest
Semantic conventions	Standardized attribute names enable cross-service querying
OTel vs. Behavioral	OTel = operational observability; Behavioral = product intelligence — complementary
Privacy	Telemetry can leak PII — use Collector processors to redact and filter

OpenTelemetry (OTel)

Sections 1–2Observability & The Three Pillars

What is Observability?

The Three Pillars of Observability

Three Pillars: Examples

Logs

Metrics

Traces

Sections 3–4The Problem & What is OTel

The Problem OTel Solves

What is OpenTelemetry?

Section 5OTel Architecture

Architecture Overview

Section 6Traces in Depth

Anatomy of a Span

Trace Waterfall Example

Section 7Metrics in Depth

Three Metric Instruments

The Cardinality Trap

Exemplars: Bridging Metrics and Traces

Sections 8–9Logs & Context Propagation

Logs in OTel

Structured vs. Unstructured

Context Propagation

Section 10Instrumentation: Auto vs. Manual

Auto-Instrumentation

Manual Instrumentation

Sections 11–12The Collector & Sampling

The OTel Collector

Sampling Strategies

Sections 13–15Conventions, Resources & Browser OTel

Semantic Conventions

Resources & Service Identity

OTel for the Web Browser

Sections 16–18OTel vs. Analytics, Production & Ecosystem

OTel vs. Behavioral Analytics

Production Concerns & Privacy

Performance Overhead

What Telemetry Can Leak

The OTel Ecosystem

SummaryKey Takeaways

OpenTelemetry at a Glance