Observability for Modern Web Systems
"OpenTelemetry is the open-source, vendor-neutral standard for generating, collecting, and exporting telemetry data — traces, metrics, and logs — so you can understand what is happening inside distributed systems."
CSE 135 — Full Overview
What observability means and the three signal types it relies on.
From control theory: a system is observable if you can determine its internal state from its external outputs.
| Monitoring | Observability |
|---|---|
| Answers known questions ("Is the server up?") | Answers unknown questions ("Why is checkout slow for users in Brazil?") |
| Predefined alerts | Ad hoc investigation |
| Explanatory Dataviz | Exploration Dataviz |
| Works well for monoliths | Essential for distributed systems |
| Focuses on symptoms | Focuses on root causes |
| Pillar | What It Is | Strengths | Weaknesses |
|---|---|---|---|
| Logs | Discrete, timestamped event records | Rich context, human-readable | High volume, hard to correlate |
| Metrics | Numerical values aggregated over time | Low overhead, great for dashboards | Aggregation loses detail |
| Traces | Request journey through distributed services | Shows causality & latency breakdown | Complex to instrument & store |
Why the observability landscape was fragmented and how OTel fixes it.
In 2019, OpenTracing and OpenCensus merged into OpenTelemetry under the CNCF.
OTel is not a backend or dashboard. It is a collection of specs, APIs, SDKs, and tools:
| Component | What It Is |
|---|---|
| Specification | Language-agnostic definitions of how signals are structured, propagated, exported |
| API | Interfaces that library authors instrument against — zero-dependency |
| SDK | Implementation that app owners install to collect and export data |
| Instrumentation Libraries | Pre-built integrations for Express, Flask, Spring, etc. |
| Collector | Standalone binary: receives, processes, exports telemetry |
| OTLP | Wire format (gRPC/HTTP protobuf) for transmitting to any backend |
How telemetry flows from your application to your backends.
The most impactful signal for understanding distributed web applications.
A trace = tree of spans. Each span = one unit of work.
| Field | Description |
|---|---|
traceId | Links this span to its trace (128-bit hex) |
spanId | Unique identifier for this span |
parentSpanId | The span that caused this span (tree structure) |
name | Operation name: GET /api/users |
startTime / endTime | When the work began and ended |
status | OK, ERROR, or UNSET |
attributes | Key-value pairs: http.method=GET, db.system=mysql |
events | Timestamped annotations (e.g., "cache miss") |
A user loads /dashboard — the trace shows exactly where time is spent:
analytics service's MySQL query consumes over half the total request time.
Counters, histograms, gauges — and the cardinality trap.
| Instrument | What It Records | Example |
|---|---|---|
| Counter | Monotonically increasing value | http.requests.total — always goes up |
| Histogram | Distribution of values (bucketized) | http.request.duration — p50, p95, p99 |
| Gauge | Point-in-time value, goes up and down | system.memory.usage — current memory |
user_id with 1M users = 5 billion series. Use high-cardinality data in traces/logs, not metrics.
Having too much data can be just as problematic as too little — it also creates false certainty.
OTel exemplars attach a trace ID to a metric data point. When you see a p99 latency spike on a dashboard, click through to the exact trace that caused it.
Connecting logs to traces, and how tracing crosses service boundaries.
OTel does not replace existing loggers — it provides a Log Bridge API to connect them to the OTel pipeline.
How does Service B know it's part of the same trace as Service A? The W3C traceparent header:
The OTel SDK automatically:
traceparent on outgoing callstraceparent, the trace breaks. This parallels the challenges of sessioning with HTTP — HTTP's statelessness affects everything built on top of it.
Two approaches to adding OTel to your application.
Patches popular libraries at load time — zero code changes to your route handlers:
Automatically creates spans for: HTTP requests (Express, Fastify), DB queries (pg, mysql2, redis), gRPC, DNS, and more.
Auto-instrumentation does not know your business logic. Add manual spans for domain operations:
The pipeline between your apps and backends, and controlling data volume.
A vendor-agnostic proxy — arguably the most important OTel component:
| Strategy | Where | How It Works | Tradeoff |
|---|---|---|---|
| Head-based | SDK (at trace start) | Decide before trace begins — e.g., keep 10% randomly | Simple, but may discard errors |
| Tail-based | Collector (after trace) | Collect all spans, then decide based on content | Keeps errors/outliers, costs memory |
| Rule-based | SDK or Collector | Always keep errors, latency > 2s, specific endpoints | Best of both, more complex config |
Standardized naming, service identity, and end-to-end browser traces.
Standardized attribute names so telemetry works across tools and teams:
| Category | Attribute | Example |
|---|---|---|
| HTTP | http.request.method | GET |
| HTTP | http.response.status_code | 200 |
| Database | db.system | mysql |
| Database | db.statement | SELECT * FROM users WHERE id=? |
| Service | service.name | user-service |
| Service | service.version | 2.1.0 |
Without conventions, one library calls it database_type, another dbSystem — cross-service queries become impossible.
service.name is the most important attribute in OTel. If you set nothing else, set this. Without it, your service shows up as unknown_service in dashboards.
@opentelemetry/sdk-trace-web enables end-to-end traces from user click through backend:
traceparent injectiontraceparent in Access-Control-Allow-Headers, the browser strips it and the trace breaks at the browser-server boundary.
How OTel relates to course projects, real-world deployment, and the broader community.
We build custom analytics in CSE 135 — where does OTel fit?
| Behavioral Analytics (Google Analytics) | Technical Analytics (OTel) | |
|---|---|---|
| Purpose | What users do (pageviews, clicks, funnels) | How the system behaves (latency, errors) |
| Audience | Product managers, designers, marketers | Developers, SREs, DevOps |
| Questions | "Where do users drop off?" | "Why is this request slow?" |
| Data model | Events with user-centric attributes | Traces/spans with system-centric attributes |
| Collection | Client-side JS beacon | SDK auto-instrumentation + Collector |
| Standards | No universal standard (GA as de facto) | OTLP, W3C Trace Context |
db.statement may contain user data in WHERE clausesOTel is governed by the CNCF with contributions from Google, Microsoft, Splunk, Datadog, and hundreds more.
19 sections of OpenTelemetry in one table.
| Concept | Key Takeaway |
|---|---|
| Observability | Understanding system internals from external outputs — goes beyond monitoring to answer "why" |
| Three pillars | Traces (request flow), Metrics (aggregated numbers), Logs (discrete events) — use all three |
| OpenTelemetry | Vendor-neutral standard for generating, collecting, and exporting telemetry |
| Architecture | App → OTel SDK → OTel Collector → Backend(s); instrument once, export anywhere |
| Context propagation | W3C traceparent carries trace/span IDs across service boundaries |
| Auto-instrumentation | Spans for HTTP, DB, RPC with zero code changes; manual spans for business logic |
| Collector | Vendor-agnostic pipeline (receivers → processors → exporters) |
| Sampling | Head-based (fast), tail-based (smart) — keep errors and outliers, sample the rest |
| Semantic conventions | Standardized attribute names enable cross-service querying |
| OTel vs. Behavioral | OTel = operational observability; Behavioral = product intelligence — complementary |
| Privacy | Telemetry can leak PII — use Collector processors to redact and filter |