OpenTelemetry (OTel)

Observability for Modern Web Systems

"OpenTelemetry is the open-source, vendor-neutral standard for generating, collecting, and exporting telemetry data — traces, metrics, and logs — so you can understand what is happening inside distributed systems."

CSE 135 — Full Overview

Sections 1–2Observability & The Three Pillars

What observability means and the three signal types it relies on.

What is Observability?

From control theory: a system is observable if you can determine its internal state from its external outputs.

MonitoringObservability
Answers known questions ("Is the server up?")Answers unknown questions ("Why is checkout slow for users in Brazil?")
Predefined alertsAd hoc investigation
Explanatory DatavizExploration Dataviz
Works well for monolithsEssential for distributed systems
Focuses on symptomsFocuses on root causes
Key insight: Monitoring tells you that something is wrong. Observability helps you understand why.
LLM Worry: LLMs may help explain system behavior, but they favor known unknowns. For genuinely new issues (unknown unknowns), always validate LLM-generated insights with domain expertise.

The Three Pillars of Observability

PillarWhat It IsStrengthsWeaknesses
LogsDiscrete, timestamped event recordsRich context, human-readableHigh volume, hard to correlate
MetricsNumerical values aggregated over timeLow overhead, great for dashboardsAggregation loses detail
TracesRequest journey through distributed servicesShows causality & latency breakdownComplex to instrument & store
The pillars work together. A metric alert tells you latency spiked. A trace shows the slow span was a database query. A log reveals the query hit a missing index. No single pillar gives you the full picture.
Not the Full Picture: The voice of the customer and qualitative factors may be lost. These often are the most important factors in actual systems. How many clicks a page gets isn't always showing positive engagement!

Three Pillars: Examples

Logs

2026-02-24T10:15:32Z [INFO] User 4821 logged in from 203.0.113.42 2026-02-24T10:15:33Z [ERROR] Payment gateway timeout after 5000ms - order #9912 2026-02-24T10:15:33Z [WARN] Retrying payment for order #9912 (attempt 2/3)

Metrics

http_requests_total{method="GET", status="200", route="/api/users"} 148239 http_request_duration_seconds{quantile="0.99"} 0.847 node_memory_usage_bytes 734003200

Traces

Span 1: HTTP GET /api/profile [0ms ─────────────────── 50ms] Span 2: gateway.route [2ms ─────── 14ms] Span 3: user.getProfile [15ms ──────────── 40ms] Span 4: db.query SELECT [17ms ────── 35ms] ◀── bottleneck!

Sections 3–4The Problem & What is OTel

Why the observability landscape was fragmented and how OTel fixes it.

The Problem OTel Solves

Before OTel: App Code ──▶ Datadog SDK ──▶ Datadog Only App Code ──▶ New Relic Agent ──▶ New Relic Only App Code ──▶ Jaeger Client ──▶ Jaeger Only Want to switch? Re-instrument everything. After OTel: App Code ──▶ OTel SDK ──▶ OTel Collector ──▶ Datadog ──▶ Jaeger ──▶ Prometheus ──▶ Any OTLP backend Want to switch? Change the Collector config.
  • Vendor lock-in: Proprietary SDKs meant switching = re-instrumenting everything
  • Competing standards: OpenTracing + OpenCensus split the community
  • Inconsistent telemetry: Different teams, different libraries, different attribute names
  • Framework blind spots: Libraries rarely shipped with built-in observability

In 2019, OpenTracing and OpenCensus merged into OpenTelemetry under the CNCF.

Prof's lesson: Protocols & Standards > Platforms. The DevOps community learned this the hard way — you'll see this pattern over and over in your career.

What is OpenTelemetry?

OTel is not a backend or dashboard. It is a collection of specs, APIs, SDKs, and tools:

ComponentWhat It Is
SpecificationLanguage-agnostic definitions of how signals are structured, propagated, exported
APIInterfaces that library authors instrument against — zero-dependency
SDKImplementation that app owners install to collect and export data
Instrumentation LibrariesPre-built integrations for Express, Flask, Spring, etc.
CollectorStandalone binary: receives, processes, exports telemetry
OTLPWire format (gRPC/HTTP protobuf) for transmitting to any backend
OTel separates instrumentation from destination. Instrument once with OTel API/SDK. Where data goes is a configuration decision, not a code change.

Section 5OTel Architecture

How telemetry flows from your application to your backends.

Architecture Overview

┌───────────────────────────────────────────────────────────────┐ │ Your Application │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Your Code │ │ Framework │ │ Library │ │ │ │ (manual │ │ (Express, │ │ (pg,redis, │ │ │ │ spans) │ │ auto-inst)│ │ http) │ │ │ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ │ └────────────────┼────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────────┐ │ │ │ OTel SDK │ │ │ │ Tracer | Meter | Log Provider │ │ │ │ Exporters (OTLP) │ │ │ └───────────────┬─────────────────┘ │ └────────────────────────┼──────────────────────────────────────┘ │ OTLP (gRPC / HTTP) ▼ ┌────────────────────────┐ │ OTel Collector │ │ Receivers │ ← accepts OTLP, Jaeger, etc. │ Processors │ ← batch, filter, enrich, sample │ Exporters │ ← fan-out to backends └─────────┬──────────────┘ ┌──────────┼──────────┐ ▼ ▼ ▼ Jaeger Prometheus Grafana
Key principle: The Collector decouples your app from your backend. Your app speaks OTLP; the Collector speaks whatever your backends need.

Section 6Traces in Depth

The most impactful signal for understanding distributed web applications.

Anatomy of a Span

A trace = tree of spans. Each span = one unit of work.

FieldDescription
traceIdLinks this span to its trace (128-bit hex)
spanIdUnique identifier for this span
parentSpanIdThe span that caused this span (tree structure)
nameOperation name: GET /api/users
startTime / endTimeWhen the work began and ended
statusOK, ERROR, or UNSET
attributesKey-value pairs: http.method=GET, db.system=mysql
eventsTimestamped annotations (e.g., "cache miss")

Trace Waterfall Example

A user loads /dashboard — the trace shows exactly where time is spent:

Trace: 4bf92f3577b34da6 Duration: 320ms Service Operation Duration Timeline ────────────────────────────────────────────────────────────── web-frontend GET /dashboard 320ms [==============================] api-gateway route /dashboard 310ms [============================] user-svc getUser(4821) 45ms [====] postgres SELECT * FROM users 12ms [=] dash-svc getDashboard(4821) 250ms [======================] redis GET cache:dash:4821 2ms [.] analytics getRecentMetrics() 180ms [==================] mysql SELECT ... GROUP BY 170ms [=================] ◀ slow! render buildHTML() 60ms [======]
Without tracing, you might spend hours guessing. One trace shows the analytics service's MySQL query consumes over half the total request time.

Section 7Metrics in Depth

Counters, histograms, gauges — and the cardinality trap.

Three Metric Instruments

InstrumentWhat It RecordsExample
CounterMonotonically increasing valuehttp.requests.total — always goes up
HistogramDistribution of values (bucketized)http.request.duration — p50, p95, p99
GaugePoint-in-time value, goes up and downsystem.memory.usage — current memory

The Cardinality Trap

// Good: low-cardinality attributes http_requests_total{method="GET", status="200"} ← finite combinations // Dangerous: high-cardinality attributes http_requests_total{user_id="4821"} ← millions of series! http_requests_total{request_id="abc123"} ← every request = new series
Cardinality explosion is the #1 operational problem with metrics. 10 methods × 10 status codes × 50 endpoints = 5,000 series. Add user_id with 1M users = 5 billion series. Use high-cardinality data in traces/logs, not metrics.

Having too much data can be just as problematic as too little — it also creates false certainty.

Exemplars: Bridging Metrics and Traces

OTel exemplars attach a trace ID to a metric data point. When you see a p99 latency spike on a dashboard, click through to the exact trace that caused it.

Dashboard: p99 latency spike at 10:15! │ ▼ (exemplar links to trace) │ Trace: 4bf92f3577b34da6 └── db.query took 170ms ◀── root cause
Exemplars are the bridge between aggregate metrics and individual request detail — one of the most powerful features of a unified observability system.

Sections 8–9Logs & Context Propagation

Connecting logs to traces, and how tracing crosses service boundaries.

Logs in OTel

OTel does not replace existing loggers — it provides a Log Bridge API to connect them to the OTel pipeline.

Respect the Past: We should not throw away existing tools unless they're fundamentally broken. OTel bridges to existing loggers rather than replacing them. Provide the upgrade path — don't replace the ecosystem wholesale!

Structured vs. Unstructured

// Unstructured console.log("User 4821 payment failed: timeout");
// Structured (OTel-enriched) { "severity": "ERROR", "body": "Payment failed", "attributes": { "user.id": 4821, "timeout.ms": 5000 }, "traceId": "4bf92f...", "spanId": "a1b2c3..." }
Correlation is the goal. Logs become evidence when connected to traces through shared IDs. UUIDs and consistent clocks are critical to this concept.

Context Propagation

How does Service B know it's part of the same trace as Service A? The W3C traceparent header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-a1b2c3d4e5f60718-01 ││ │ │ │ ││ └─ trace ID (32 hex) └─ parent span ID └─ flags │└─ version └─ (literal)

The OTel SDK automatically:

  1. Extracts trace ID and parent span ID from incoming request
  2. Creates a new child span linked to the parent
  3. Injects updated traceparent on outgoing calls
Service A Service B Service C ────────── ────────── ────────── Span: "GET /api" Span: "getUser" Span: "query" traceId: abc123 traceId: abc123 traceId: abc123 │ traceparent: abc123 │ traceparent: abc123 │ ├──────────────────────────▶├─────────────────────────▶│
Propagation only works if every service participates. If one service strips traceparent, the trace breaks. This parallels the challenges of sessioning with HTTP — HTTP's statelessness affects everything built on top of it.

Section 10Instrumentation: Auto vs. Manual

Two approaches to adding OTel to your application.

Auto-Instrumentation

Patches popular libraries at load time — zero code changes to your route handlers:

// tracing.js — loaded before your app const { NodeSDK } = require('@opentelemetry/sdk-node'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();
# Run your app with tracing preloaded node --require ./tracing.js app.js

Automatically creates spans for: HTTP requests (Express, Fastify), DB queries (pg, mysql2, redis), gRPC, DNS, and more.

Manual Instrumentation

Auto-instrumentation does not know your business logic. Add manual spans for domain operations:

const tracer = trace.getTracer('checkout-service'); async function processCheckout(cart, user) { return tracer.startActiveSpan('checkout.process', async (span) => { span.setAttribute('user.id', user.id); span.setAttribute('cart.items', cart.items.length); const inventory = await checkInventory(cart); // auto-instrumented const payment = await chargePayment(cart, user); // auto-instrumented span.setStatus({ code: SpanStatusCode.OK }); span.end(); }); }
Best practice: Start with auto-instrumentation for immediate visibility. Then add manual spans for key business operations where you need domain context.
Tradeoff: Auto-instrumentation is easy, but you may not get what you want or get too much data obscuring what you really need.

Sections 11–12The Collector & Sampling

The pipeline between your apps and backends, and controlling data volume.

The OTel Collector

A vendor-agnostic proxy — arguably the most important OTel component:

  • Decoupling: App sends OTLP to localhost; Collector handles routing
  • Processing: Batch, filter, sample, enrich before export
  • Fan-out: Same data to Jaeger (traces) + Prometheus (metrics) simultaneously
  • Protocol translation: Accept Jaeger format, export as OTLP (or vice versa)
Collector Pipeline: ┌────────────┐ ┌────────────────┐ ┌────────────┐ │ Receivers │ │ Processors │ │ Exporters │ │ │ │ │ │ │ │ OTLP(gRPC)├───▶│ memory_limiter ├───▶│ Jaeger │ │ OTLP(HTTP)│ │ batch │ │ Prometheus │ │ Jaeger │ │ attributes │ │ Datadog │ │ Zipkin │ │ tail_sampling │ │ Logging │ └────────────┘ └────────────────┘ └────────────┘ Each signal type (traces, metrics, logs) has its own pipeline.
Data Pipeline Familiarity: This store-and-forward approach mirrors the general data pipeline pattern we discussed. A common and proven pattern.

Sampling Strategies

Sized Solutions: Be careful with sampling advice when starting out. When you are small, every single customer counts — you may not want to sample at all!
StrategyWhereHow It WorksTradeoff
Head-basedSDK (at trace start)Decide before trace begins — e.g., keep 10% randomlySimple, but may discard errors
Tail-basedCollector (after trace)Collect all spans, then decide based on contentKeeps errors/outliers, costs memory
Rule-basedSDK or CollectorAlways keep errors, latency > 2s, specific endpointsBest of both, more complex config
Common production pattern: 100% collection in SDK → tail-based sampling in Collector (keep errors + slow traces + 5% random). Full visibility into problems, controlled costs. But transmission costs are the trade-off!

Sections 13–15Conventions, Resources & Browser OTel

Standardized naming, service identity, and end-to-end browser traces.

Semantic Conventions

Standardized attribute names so telemetry works across tools and teams:

CategoryAttributeExample
HTTPhttp.request.methodGET
HTTPhttp.response.status_code200
Databasedb.systemmysql
Databasedb.statementSELECT * FROM users WHERE id=?
Serviceservice.nameuser-service
Serviceservice.version2.1.0

Without conventions, one library calls it database_type, another dbSystem — cross-service queries become impossible.

Resources & Service Identity

service.name is the most important attribute in OTel. If you set nothing else, set this. Without it, your service shows up as unknown_service in dashboards.

OTel for the Web Browser

@opentelemetry/sdk-trace-web enables end-to-end traces from user click through backend:

  • Document load: DNS, TCP, TLS, TTFB — Navigation Timing API
  • Resource loading: Every CSS, JS, image fetch — Resource Timing API
  • Fetch/XHR: Outgoing API calls with automatic traceparent injection
  • User interactions: Clicks, form submissions, SPA route changes
  • Errors: Uncaught exceptions and unhandled promise rejections
Browser Server Database ─────── ────── ──────── [click "Buy"] Span: "interaction.click" ├─ Span: "HTTP POST /checkout" │ traceparent: 00-abc123 ───────▶ Span: "POST /checkout" │ ├─ Span: "validate.cart" │ ├─ Span: "charge.payment" │ ├─ Span: "db.insert" ────▶ INSERT │ ◀─── 201 ───────────────────────── ├─ Span: "render.confirmation"
CORS matters. If your API doesn't include traceparent in Access-Control-Allow-Headers, the browser strips it and the trace breaks at the browser-server boundary.

Sections 16–18OTel vs. Analytics, Production & Ecosystem

How OTel relates to course projects, real-world deployment, and the broader community.

OTel vs. Behavioral Analytics

We build custom analytics in CSE 135 — where does OTel fit?

Behavioral Analytics
(Google Analytics)
Technical Analytics
(OTel)
PurposeWhat users do (pageviews, clicks, funnels)How the system behaves (latency, errors)
AudienceProduct managers, designers, marketersDevelopers, SREs, DevOps
Questions"Where do users drop off?""Why is this request slow?"
Data modelEvents with user-centric attributesTraces/spans with system-centric attributes
CollectionClient-side JS beaconSDK auto-instrumentation + Collector
StandardsNo universal standard (GA as de facto)OTLP, W3C Trace Context
Complementary, not competing. A mature app uses both: behavioral analytics to understand users and OTel to understand the system. You can even route analytics events through the OTel Collector.

Production Concerns & Privacy

Performance Overhead

  • CPU: Single-digit microseconds per span
  • Memory: Spans buffered before export; batch processor controls size
  • Network: Async batched export, but high-throughput = megabytes/sec of telemetry
  • Mitigation: Head-based sampling, tuned batch sizes and intervals

What Telemetry Can Leak

  • DB queries: db.statement may contain user data in WHERE clauses
  • URLs: Query params may contain tokens, emails, PII
  • Headers: Authorization, cookies, API keys
  • Errors: Stack traces may reveal architecture or user data
Telemetry is a data pipeline — it needs privacy review. Use the Collector's processor pipeline to redact, filter, and hash sensitive attributes before data leaves your infrastructure.

The OTel Ecosystem

OTel is governed by the CNCF with contributions from Google, Microsoft, Splunk, Datadog, and hundreds more.

  • Second most active CNCF project (after Kubernetes)
  • Supported natively by every major observability vendor
  • 100+ receivers, processors, and exporters in Collector Contrib
  • Kubernetes Operator for automated deployment
  • Emerging: profiling as a 4th signal, eBPF-based zero-code instrumentation
The question is no longer "should we use OTel?" but "how do we use OTel?" Students pursuing DevOps careers will find OTel knowledge very valuable.
Market Forces: The OTel architecture reflects SaaS-era thinking. Given seismic shifts with AI, the landscape may evolve. Standards and protocols endure longer than any particular platform.

SummaryKey Takeaways

19 sections of OpenTelemetry in one table.

OpenTelemetry at a Glance

ConceptKey Takeaway
ObservabilityUnderstanding system internals from external outputs — goes beyond monitoring to answer "why"
Three pillarsTraces (request flow), Metrics (aggregated numbers), Logs (discrete events) — use all three
OpenTelemetryVendor-neutral standard for generating, collecting, and exporting telemetry
ArchitectureApp → OTel SDK → OTel Collector → Backend(s); instrument once, export anywhere
Context propagationW3C traceparent carries trace/span IDs across service boundaries
Auto-instrumentationSpans for HTTP, DB, RPC with zero code changes; manual spans for business logic
CollectorVendor-agnostic pipeline (receivers → processors → exporters)
SamplingHead-based (fast), tail-based (smart) — keep errors and outliers, sample the rest
Semantic conventionsStandardized attribute names enable cross-service querying
OTel vs. BehavioralOTel = operational observability; Behavioral = product intelligence — complementary
PrivacyTelemetry can leak PII — use Collector processors to redact and filter