Web Analytics Overview

Measuring the Web

Web analytics is the collection, measurement, and analysis of website or app usage data. Be careful to not trivialize this as hit counting or visitors, web analytics data when done correctly, is not just a marketing tool — it is operational intelligence that can help developers find errors and performance issues, designers improve aesthetic acceptance and usability, and importantly owner to measure outcomes and insure web efforts maintain economic viability.

Analytics answers the questions you cannot answer just by looking at your own code. They address what happens when users meet the executing code. It could tell us:

Analytics can answer these types of questions and more, but be careful not to just collect data and look for insights. We really need to determine our questions first and aim to prove/disprove our beliefs rather than hoping a magical insight emerges from the data.

1. Why Web Analytics?

Web analytics helps answer questions like:

These questions serve three distinct purposes:

  1. Error tracking and elimination — Finding broken pages, failed requests, JavaScript exceptions, and resource load failures that users never report.
  2. User behavior monitoring — Understanding what users actually do: which features they use, which paths they take, where they abandon tasks.
  3. Usability and interest measurement — Determining whether the interface works, whether content resonates, and whether changes improve or degrade the experience.

These three purposes map directly to the three participant groups in any web application: developers care about errors, business stakeholders care about conversions and engagement, and users benefit from a better experience even though they never see the analytics data.

Analytics applies to all three participant groups: developers track errors, business tracks conversions, users benefit from a better experience. A single analytics system can serve all three — the same data answers different questions depending on who is asking.

2. The Analytics Stack

A basic analytics pipeline has three parts:

  1. Collection — Receives data from client-side scripts (beacons, events) or ingests server logs. This is the entry point for all analytics data.
  2. Storage — A database optimized for time-series queries: how many page views last week? What was the error rate yesterday? Typical choices include PostgreSQL, ClickHouse, or cloud data warehouses.
  3. Reporting — Dashboards and visualizations that transform raw data into actionable insight. This is where numbers become decisions.
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌───────────┐ │ Client │──────▶ │ Collection │──────▶ │ Storage │──────▶ │ Reporting │ │ (Browser)│ HTTP │ (Server) │ INSERT │(Database)│ SELECT │(Dashboard)│ └──────────┘ POST └─────────── ┘ └──────────┘ └───────────┘ │ │ │ │ JS events, Receives & Time-series Charts, tables, performance validates queries alerts, exports timing, errors payloads

In this course, you build each part yourself. The collector receives HTTP POST requests containing analytics events, and we collect data via HTTP access logs. We will aggregate this data into the storage layer, so the reporting layer can be used to form queries on the data and present it visually.

Beware Pipelines Vary: A simple data pipeline may have more steps than presented. For example, we might suggest that data sanitization should be distinctly separated from the collection step. Further, we might call out reporting as an activity that supports analysis, creation, and consumption. Conceptually, though all the ideas are present in most pipelines even if not called out directly.

3. Data Collection Methods

There are three fundamental approaches to collecting analytics data, each capturing data at a different point in the client-server communication:

  1. Server logs — The web server automatically records every request: IP address, URL, status code, User-Agent, timestamp. In theory, any HTTP packet, and that could include extra data provided using Client Hints. A motivating aspect of logging is that no client code changes are needed. While this is the oldest method, it is also the most reliable one, and complete analytics solutions leverage logs.
  2. Network capture (packet sniffing) — Intercepting traffic between client and server to inspect the full request/response. Requires network access but no server or client changes. Largely obsolete for content analysis since HTTPS, unless within a datacenter, where you could install a certificate on the device.
  3. Client-side scripts — JavaScript code running in the browser or other native code in a mobile application that captures events (clicks, scrolls, errors, timing) and sends them to a collector. This is the most flexible method and collects a wealth of information, but it requires code deployment and depends on JavaScript being enabled.
┌──────────┐ ┌──────────┐ │ Client │─────────────── Network ──────────────────▶│ Server │ │ (Browser)│ │ │ └──────────┘ └──────────┘ │ │ │ Client-side Network capture Server logs scripts capture: captures: capture: • DOM events • Request/response • IP address • Scroll depth headers • URL path • Click coordinates • Payload content • Status code • JS errors (HTTP only, not • User-Agent • Performance timing HTTPS content) • Timestamp • Viewport size • Connection metadata • Response size
Aspect Server Logs Network Capture Client-Side Scripts
What it captures HTTP requests received by server All traffic on the wire Any browser event or state
Requires code changes? No — built into web servers No — passive observation Yes — must add JS to pages
Captures client events? No — only sees requests No — only sees network traffic Yes — clicks, scrolls, errors
Works with HTTPS? Yes — runs on the server Metadata only — content encrypted Yes — runs in the browser
Performance impact Minimal — logging is routine Variable — depends on volume Variable — adds JS payload
Privacy concerns Moderate — IP, paths High — deep packet inspection Very High — can capture anything
Example tools Apache/Nginx logs, GoAccess, AWStats Wireshark, tcpdump Google Analytics, custom beacons

Cross-references: Web Servers Overview covers server logging configuration; HTTP Overview covers the headers that appear in logs.

HTTPS makes network capture mostly obsolete for content analysis. You can see metadata (IP addresses, timing, connection info), but not request/response content unless the capture device has the SSL certificate installed on it as well and terminates and forwards the request. Modern analytics relies on server logs and client-side scripts. Network capture remains useful for debugging connection-level issues, not for analytics.

4. What Can Be Collected

Analytics data comes from four sources, each providing different information automatically or with code:

From HTTP Headers (automatic)

  1. IP address — geographic location, ISP, rough identity
  2. User-Agent — browser, OS, device type
  3. Referer — where the user came from (previous page or search engine)
  4. Accept-Language — user's language preferences
  5. Cookies — session identifiers, tracking IDs, preferences
  6. Other Headers — if logs are extended any header can be collected, including those set by script code or added because of Client Hints — see Section 5 (Enriching Server Logs) for details on log formats, Client Hints, and script-to-header techniques

From the URL (automatic)

From the Server (automatic)

From JavaScript (requires code)

Cross-references: URL Overview covers query parameters and UTM tracking codes; HTTP Overview covers request/response headers; State Management covers cookies and session tracking.

Aim for a Complete Picture Client-side scripts provide the most (clicks, scrolls, errors), but many of the other basics, like type of browser, request URL, referer URL, and more, are already in your access logs or could be. Given that the script could be disabled or not run by bots, it makes the most sense to capture both logs and script data and merge them. Sadly implementation difficulty, organizational inertia, or privacy concerns may require you to choose one or the other.

5. Enriching Server Logs

Server access logs are the oldest and most reliable analytics data source — every request is recorded without any client-side code. But the default Common Log Format captures only 7 fields. With configuration changes and a few techniques, you can transform logs into a rich analytics dataset that rivals client-side collection for many use cases.

Log Formats — From Common to Custom

The Common Log Format (CLF) gives you IP, identity, user, timestamp, request line, status, and size — only 7 fields, with no browser info, no referrer, and no timing. The Combined format adds Referer and User-Agent, making it the de facto standard for analytics. But custom formats can capture any HTTP header, server variable, cookie, or timing metric.

Log Format Fields Analytics Value
Common (CLF) IP, identity, user, timestamp, request line, status, size Basic hit counting only — no browser or referrer data
Combined CLF + Referer + User-Agent Traffic sources, browser/OS breakdown — the analytics minimum
Custom / Extended Any header, cookie, variable, or timing metric Rich analytics: response time, language, viewport, device hints

Both Apache and Nginx support custom log formats that can capture any request header:

# Apache — LogFormat + CustomLog
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D \"%{Accept-Language}i\" \"%{X-Viewport}i\"" enriched
CustomLog /var/log/apache2/access.log enriched

# %D = response time in microseconds
# %{HeaderName}i = any request header
# Nginx — log_format + access_log
log_format enriched '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent" '
                    '$request_time "$http_accept_language" "$http_x_viewport"';
access_log /var/log/nginx/access.log enriched;

# $request_time = response time in seconds
# $http_headername = any request header (lowercase, dashes become underscores)

The key insight: %{HeaderName}i (Apache) and $http_headername (Nginx) let you log any request header. This is what makes the techniques below possible.

Cross-reference: Web Servers Overview covers basic log configuration.

Client Hints — Structured Device Information

User-Agent strings are chaotic, bloated, and increasingly frozen by browsers. Client Hints are the modern replacement: structured, opt-in headers the server requests. The server sends an Accept-CH header (or <meta http-equiv="Accept-CH">) listing desired hints — the browser then includes them on subsequent requests, and the server logs them automatically.

Client Hint What It Provides Example Value
Sec-CH-UABrowser brand and version"Chromium";v="124", "Chrome";v="124"
Sec-CH-UA-MobileMobile device??0 (no) or ?1 (yes)
Sec-CH-UA-PlatformOperating system"macOS"
Sec-CH-UA-Platform-VersionOS version"14.5"
Sec-CH-UA-Full-Version-ListDetailed browser versionsFull version strings
Sec-CH-UA-ModelDevice model"Pixel 8"
Sec-CH-UA-ArchCPU architecture"arm"
Sec-CH-Viewport-WidthViewport width in CSS pixels1440
Sec-CH-DPRDevice pixel ratio2.0
ECTEffective connection type4g, 3g, 2g, slow-2g
DownlinkEstimated bandwidth (Mbps)10.0
RTTEstimated round-trip time (ms)50

Client Hints "upscale" your logs: structured device info, network quality, and viewport data — all from headers, logged automatically. However, they are Chromium-only (Chrome, Edge, Opera). Firefox and Safari don't support them, so User-Agent remains necessary as a fallback.

Client Hints are a philosophical shift: instead of the client announcing everything about itself (bloated UA), the server asks for what it needs. Better for privacy and better for analytics — structured fields vs regex-unfriendly strings.

Script-to-Header — Bridging Client and Server Logs

JavaScript knows viewport, scroll depth, errors, and color scheme. Server logs know timing, status codes, and upstream latency. Merging both is ideal. The technique: JavaScript sets a cookie with client-side data — the browser sends it on subsequent requests — the server log format captures it.

// Set client-side data as a cookie for server log capture
document.cookie = '_viewport=' + window.innerWidth + 'x' + window.innerHeight
    + ';path=/;SameSite=Lax;max-age=1800';

// Or set a custom header on fetch requests
fetch('/api/data', {
    headers: { 'X-Viewport': window.innerWidth + 'x' + window.innerHeight }
});

The server captures the cookie via %{_viewport}C (Apache) or $cookie__viewport (Nginx). The advantage: client data appears in server logs without a separate beacon endpoint. The limitation: cookie-based data is one request behind (the cookie is set on one request and sent on the next), and cookies add overhead to every request. This technique works best for data that changes infrequently: viewport, timezone, color scheme, device pixel ratio.

Log Forwarding — Third-Party Analysis of First-Party Data

A hybrid model: collect logs first-party, then forward them to a third party for analysis. This gives you the privacy benefits of first-party collection with the analysis power of third-party tools.

Web Server ──writes──▶ access.log ──▶ Log Shipper ──ships over HTTPS──▶ Analysis Service (Filebeat, (ELK, Splunk, Fluentd, Datadog, or Fluent Bit) custom)

Common log shippers include rsyslog/syslog-ng (traditional), Filebeat/Fluentd/Fluent Bit (modern), cloud agents, or even a simple pipe-to-script. The privacy advantage is significant: the browser never talks to the third party. You decide what fields to forward and what to redact.

This approach blurs the first-party / third-party line — collection is first-party, analysis may be third-party. Cross-reference: the First-Party vs Third-Party section below discusses this middle ground.

Every additional header you log increases storage and may contain PII. Apply data minimization to logs: log what you need, not everything you can. Rotate and expire logs on a defined schedule, and redact or hash IP addresses if you do not need them for analysis.

6. First-Party vs. Third-Party Analytics

The distinction between first-party and third-party analytics is about who collects the data and where it goes:

The technical difference is straightforward: where does the analytical data you collect go? If sendData() goes to analytics.yoursite.com, it is first-party. If it goes to google-analytics.com, it is third-party.

Aspect First-Party Third-Party
Data ownership You own it completely Vendor stores and may use it
Cookie scope Your domain only Vendor's domain (cross-site capable)
Privacy compliance Simpler — you control data flows Complex — data leaves your control
Setup effort High — build and maintain infrastructure Low — add a script tag
Cross-site tracking Not possible (your domain only) Possible (vendor sees all their clients' sites)
Ad blocker impact Usually unblocked (same domain) Often blocked (known tracking domains)
Cost Infrastructure and development time Often perceptually "free" (you pay with data)

Cross-references: State Management covers first-party vs third-party cookies; Foundations Overview covers trust boundaries and data collection zones.

When a service is free, you or your users are the product. Third-party analytics vendors aggregate data across all their customers' sites, building cross-site user profiles. This is exactly what privacy regulations like GDPR target. First-party analytics avoids this entirely because the data never leaves your infrastructure.

7. Browser Fingerprinting & User Identification

A fundamental analytics challenge is identifying unique users across visits. Cookies are the traditional mechanism — set a unique ID in a cookie, and you recognize the user when they return. But cookies can be cleared, blocked by browsers, rejected by privacy-conscious users, or simply not persist in incognito mode. Fingerprinting is the alternative.

What Is Browser Fingerprinting?

Browser fingerprinting combines multiple browser and device attributes to create a quasi-unique identifier. No single attribute is unique — millions of people use Chrome on Windows — but the combination of attributes is surprisingly distinctive. It works without cookies, without login, and without any stored state on the client.

Attribute Source How It Helps
User-Agent string HTTP header Browser + OS + version
Screen resolution screen.width/height Device type
Installed fonts Canvas/JS enumeration Highly variable across systems
Canvas rendering Canvas API GPU/driver-specific pixel differences
WebGL renderer WebGL API GPU hardware identifier
Timezone Intl.DateTimeFormat Geographic signal
Language navigator.language Locale preference
Platform navigator.platform OS identifier
Hardware concurrency navigator.hardwareConcurrency CPU core count
Audio fingerprint AudioContext API Audio stack differences
Installed plugins navigator.plugins (legacy) Increasingly limited

How Fingerprint Hashing Works

The process is conceptually simple: collect attributes, concatenate them into a string, and hash the result to produce a fingerprint ID:

// Conceptual fingerprint generation
async function generateFingerprint() {
    const components = [
        navigator.userAgent,
        screen.width + 'x' + screen.height,
        navigator.language,
        navigator.platform,
        navigator.hardwareConcurrency,
        Intl.DateTimeFormat().resolvedOptions().timeZone,
        getCanvasFingerprint(),   // render text to canvas, extract pixel data
        getWebGLRenderer()        // query GPU info via WebGL
    ];

    const raw = components.join('|');
    const hash = await crypto.subtle.digest('SHA-256',
        new TextEncoder().encode(raw));
    return Array.from(new Uint8Array(hash))
        .map(b => b.toString(16).padStart(2, '0')).join('');
}
// Result: something like "a3f2b8c1e9d4..." — fairly stable across visits

The resulting hash is fairly stable across visits from the same browser on the same device. It changes when the user updates their browser, installs new fonts, or changes OS settings — but it persists through cookie clearing and incognito mode.

Effectiveness and Limitations

Studies have shown that canvas rendering + WebGL + fonts can uniquely identify over 90% of desktop browsers. However, fingerprinting is not perfect:

Cross-reference: State Management covers the cookie-based approach to user identification — fingerprinting is what works when cookies don't.

Fingerprinting vs. Cookies

Aspect Cookies Fingerprinting
User can clear Yes No
Blocked by ad blockers Sometimes Harder to block
Stability Until cleared Until browser/OS updates
Accuracy Exact (unique ID) Probabilistic (~90%+)
Privacy regulation Covered by GDPR Also covered by GDPR
Cross-device No (unless synced) No (different hardware)

Privacy and Ethics

GDPR considers fingerprinting personal data — consent is required just as it is for cookies. But unlike cookies, users cannot easily see, inspect, or clear a fingerprint. This asymmetry is ethically problematic. Browsers are actively fighting fingerprinting: Firefox offers "fingerprinting protection" that standardizes canvas output and hides hardware details, Safari's Intelligent Tracking Prevention (ITP) limits fingerprintable APIs, and Chrome has signaled similar intentions.

Fingerprinting works precisely because every browser is slightly different. The same features that let websites adapt to your device — screen size, GPU, fonts, timezone — can be combined to identify you. This is why privacy advocates and browser vendors are actively working to reduce fingerprintable surface area.
Just because you can fingerprint users doesn't mean you should. Unlike cookies, users cannot easily see or clear a fingerprint. GDPR treats fingerprints as personal data requiring consent. Use fingerprinting only when you have a legitimate, disclosed purpose — and prefer cookies or login-based identification when possible.

8. Collection Philosophy: Broad vs Targeted

There are two opposing philosophies for what data to collect:

Aspect Broad Collection Targeted Collection
Data volume Very high Low to moderate
Storage cost High and growing Predictable and manageable
Discovering unknowns Strong — data is already there Weak — must add new instrumentation
Privacy risk High — you may collect PII without realizing Low — you know exactly what you have
GDPR alignment Poor — violates data minimization Good — purpose-limited collection
Setup time Fast initially, hard to query later Slower initially, easy to query later

In practice, most teams start broad and narrow over time as they learn what matters. The key constraint is privacy law.

GDPR's data minimization principle says you should only collect data you have a specific purpose for. "We might need it someday" is not a purpose. Before collecting any user data, define why you need it, how long you will keep it, and who will access it. This applies whether you build first-party or use third-party analytics. I contend that information hoarding is a serious privacy risk, and that it is better to collect as little as possible until you know what you need. A breach could result in a data leak that could be very damaging to your business. Furthermore, too much data can create a form of data smog that may actually obfuscate rather than illuminate.

9. Privacy & Consent

Every analytics technique covered so far — server logs, client-side scripts, fingerprinting, session replay — collects data about real people. Privacy law exists precisely because analytics capabilities outpaced user expectations. If you build an analytics system (as you do in this course), you need to understand the legal landscape.

GDPR — The Global Standard

The General Data Protection Regulation (EU, 2018) applies to anyone processing data of EU residents, regardless of where the company is located. Six core principles are directly relevant to analytics:

  1. Lawfulness — You need a legal basis for processing: consent or legitimate interest.
  2. Purpose limitation — Collect for a stated purpose; don't repurpose data later.
  3. Data minimization — Collect only what is necessary for that purpose.
  4. Accuracy — Keep data correct and up to date.
  5. Storage limitation — Don't keep data longer than needed.
  6. Integrity & confidentiality — Protect the data you hold.

Key individual rights include: the right to access (see what data you hold), the right to erasure ("right to be forgotten"), and the right to data portability. Enforcement is serious: fines up to 4% of global annual revenue or €20M, whichever is greater.

ePrivacy Directive & Cookie Consent

The ePrivacy Directive is separate from GDPR and specifically governs electronic communications, including cookies. It is the origin of cookie banners and consent pop-ups. The core rule: you must obtain informed consent before setting non-essential cookies. Analytics cookies are non-essential. Only "strictly necessary" cookies (session management, security tokens) are exempt.

Beyond Europe

Privacy regulation is expanding globally, not contracting:

Consent Mechanisms

What This Means for Your Analytics

What You're Doing Consent Required? Why
Server access logs (IP, path, UA) Usually not Standard operational logging qualifies as legitimate interest
First-party analytics cookies Yes (ePrivacy) Non-essential cookie — requires informed consent
Fingerprinting Yes (GDPR) Creates personal data through processing
Session replay Yes (GDPR) Records user behavior and may capture PII
Third-party analytics (e.g. GA) Yes (GDPR + ePrivacy) Data leaves your control; third-party cookies
Aggregate, cookie-free metrics only Often not No personal data processed — no individual tracking
Privacy regulation doesn't prohibit analytics — it requires transparency and consent. The simplest path to compliance: use first-party, cookie-free analytics that collects only aggregate data. Tools like Plausible and Umami demonstrate that useful analytics doesn't require tracking individual users.
Ignoring privacy law is not a viable strategy. GDPR fines are real — Google was fined €150M, Amazon €746M. Even if you are a student building a course project, understanding consent requirements now will save you from costly mistakes in production systems later.

10. Error Tracking and Elimination

The most immediate value of analytics for developers is finding errors that users never report. Most users who encounter a broken page simply leave — they do not file a bug report. Interestingly, the marketing of all things means we adopt a specialized term for this use of analytics to see how software is behaving: observability.

Server-Side Error Detection

Server access logs contain HTTP status codes for every request. Filter for 4xx (client errors: broken links, missing resources) and 5xx (server errors: crashes, timeouts). This requires no code changes — just log analysis.

Client-Side Error Detection

JavaScript errors, failed resource loads, and unhandled promise rejections are invisible to the server unless you explicitly capture and report them:

// Capture JavaScript errors
window.addEventListener('error', function(event) {
    const errorData = {
        type: 'js_error',
        message: event.message,
        source: event.filename,
        line: event.lineno,
        column: event.colno,
        timestamp: Date.now(),
        url: location.href,
        userAgent: navigator.userAgent
    };

    // sendBeacon guarantees delivery even during page unload
    navigator.sendBeacon('/collect', JSON.stringify(errorData));
});

// Capture unhandled promise rejections
window.addEventListener('unhandledrejection', function(event) {
    navigator.sendBeacon('/collect', JSON.stringify({
        type: 'promise_rejection',
        reason: String(event.reason),
        timestamp: Date.now(),
        url: location.href
    }));
});

Cross-references: HTTP Overview covers status codes; Web Servers Overview covers access logs and error logs.

navigator.sendBeacon() was designed for analytics — it guarantees delivery even when the page is unloading. Unlike XMLHttpRequest or fetch(), a beacon will not be cancelled when the user navigates away or closes the tab. This makes it essential for capturing exit events and errors. However, it is not always appropriate, and sometimes you'll see very old-style approaches, such as JavaScript enhanced ``- style beacons for robust collection or the use of sockets for event streaming.

11. Performance Analytics & Web Vitals

A page that works perfectly but takes 8 seconds to load is a failed page. Performance is a core analytics concern — it affects user experience, conversion rates, SEO rankings, and revenue. Google uses performance metrics as ranking signals. Measuring real user performance is a direct application of analytics.

Core Web Vitals

Core Web Vitals are Google's framework for measuring user experience through three metrics:

Metric What It Measures Good Threshold API
LCP (Largest Contentful Paint) Loading — when the main content is visible ≤ 2.5s PerformanceObserver
INP (Interaction to Next Paint) Responsiveness — delay from user input to visual update ≤ 200ms PerformanceObserver
CLS (Cumulative Layout Shift) Visual stability — unexpected layout movement ≤ 0.1 PerformanceObserver

These are measured on real users, not lab tests. That is the analytics connection — you need to collect these metrics from actual visitor sessions. Google uses Core Web Vitals as search ranking signals, meaning poor performance literally reduces your search visibility.

RUM vs. Synthetic Monitoring

Aspect Real User Monitoring (RUM) Synthetic Monitoring
Data source Actual user sessions Scripted bots from controlled locations
What it tells you Real-world experience across devices/networks Baseline performance under controlled conditions
Variability High — real networks, real devices Low — consistent test environment
Coverage Only pages users visit Any page you configure
Setup Analytics script on every page Configure test scenarios
Examples web-vitals JS library, custom beacons Lighthouse, WebPageTest, Pingdom

RUM is analytics. Synthetic is testing. Both are necessary but serve different purposes. For the course project, you are building RUM — collecting real performance data from real visitors.

The Performance APIs

Key browser APIs for collecting performance data:

// Using the web-vitals library (Google's official library)
import {onLCP, onINP, onCLS} from 'web-vitals';

function sendMetric(metric) {
    navigator.sendBeacon('/collect', JSON.stringify({
        type: 'web_vital',
        name: metric.name,      // 'LCP', 'INP', or 'CLS'
        value: metric.value,
        rating: metric.rating,  // 'good', 'needs-improvement', or 'poor'
        url: location.href,
        timestamp: Date.now()
    }));
}

onLCP(sendMetric);
onINP(sendMetric);
onCLS(sendMetric);

What Performance Data Tells You

Performance analytics is where the developer, business, and user perspectives converge. Developers see slow queries and render-blocking resources. Business sees lost conversions. Users see a sluggish experience. A single LCP measurement captures all three concerns in one number.

Cross-reference: The Enriching Server Logs section (Section 5) covers how Client Hints like ECT, Downlink, and RTT can provide network quality data without JavaScript.

12. User Behavior and Usability

Simple error tracking is just the start; analytics can reveal how users actually interact with your site. The key behavioral metrics are:

These metrics form a funnel — each stage filters out users who do not proceed:

┌─────────────────────────────────────────────────┐ │ Page Views (100%) │ │ All visitors who arrive │ └──────────────────────┬──────────────────────────┘ │ ~60% continue ┌──────────────────────▼──────────────────────────┐ │ Engagement (60%) │ │ Scroll, click, spend >10 seconds │ └──────────────────────┬──────────────────────────┘ │ ~25% continue ┌──────────────────────▼──────────────────────────┐ │ Action (15%) │ │ Add to cart, start form, watch video │ └──────────────────────┬──────────────────────────┘ │ ~30% continue ┌──────────────────────▼──────────────────────────┐ │ Conversion (5%) │ │ Purchase, submit, sign up │ └─────────────────────────────────────────────────┘

Understanding usability through analytics means looking for signals of confusion: high bounce rates on landing pages, form abandonment halfway through especially when hitting a sticky field, features with zero clicks despite being prominently placed, or users repeatedly hitting the back button, excessive clicks on an unclickable object dubbed a rage click, and many more signals can point us towards user problems. However, we need to be careful as sometimes we can infer the wrong thing from a behavior. For example, many supposed dead clicks on a page could indicate anchor or scrolling touches, and users may also be highlighting content for copy-paste. A full session replay can capture nuance, but aggregating all details is hard, and watching replays can be time-consuming.

The biggest usability insights often come from what users do NOT do. 1000 page views but 0 clicks on the call-to-action (CTA) tells you the CTA is invisible, confusing, or irrelevant. High traffic to a help page suggests the main interface is unclear. Analytics turns the absence of action into useful data.

13. Session Replay

Session replay takes analytics from numbers to narrative. Instead of knowing that 40% of users abandon a form, you can watch exactly what confused them.

How It Works

Session replay does not record video of the user's screen. Instead, it captures a stream of DOM mutations, mouse movements, scroll positions, clicks, and input events. On playback, it reconstructs the DOM state at each point in time, creating a faithful recreation of what the user saw and did.

The data format is a JSON stream of events, typically 100–500KB per session depending on page complexity and session length. This is far smaller than screen capture video would be. Though some folks will make movies from this data anyway.

Aspect Detail
How it works Captures DOM mutations, mouse movement, scroll, clicks, inputs as JSON events
Data format JSON event stream, ~100–500KB per session
Sensitive data Must mask passwords, PII, payment fields — best tools mask by default
Privacy Requires explicit user consent under GDPR; must be disclosed in privacy policy
Value Qualitative insight — see exactly what confused users experienced
Example tools rrweb (open source), FullStory, Hotjar, LogRocket

The qualitative leap session replay provides is significant: instead of inferring user intent from aggregate numbers, you see exactly what happened. A user hovering over the wrong button, scrolling past the CTA, or rage-clicking a non-interactive element tells a story that no metric can easily capture. Usability replay capture tools can even be used to capture video of a user to correlate facial tells and even user verbalizations (yes this might include cursing) to provide observational insights beyond the click stream. This approach is not realistic outside the lab, but the few times I have seen it used, it produced more interesting results than just replay followed by user interviews.

Session replay is powerful but privacy-sensitive. Always mask form inputs that might contain PII — names, emails, passwords, credit card numbers. The best replay tools mask inputs by default and require you to explicitly opt in to capturing sensitive fields. Never record payment forms or login credentials.

14. Observability and OpenTelemetry

Web analytics focuses on user behavior. Observability is the parallel discipline focused on system behavior — understanding what is happening inside your servers, databases, and services from their external outputs.

The Three Pillars

┌───────────────┐ │ Observability │ └──────┬────────┘ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Logs │ │ Metrics │ │ Traces │ │ │ │ │ │ │ │ What │ │ How is │ │ Where │ │ happened │ │ it doing │ │ did the │ │ │ │ │ │ request │ │ Discrete │ │ Numeric │ │ go │ │ events │ │ time- │ │ │ │ with │ │ series │ │ End-to- │ │ context │ │ data │ │ end path │ └──────────┘ └──────────┘ └──────────┘

Analytics is observability for user behavior. Observability is analytics for software and system behavior. The two disciplines share tools, techniques, and infrastructure. A slow page load might be a user experience problem (analytics) caused by a slow query that may be caused by code or database (observability).

OpenTelemetry (OTel)

OpenTelemetry is a vendor-neutral open standard for telemetry data — traces, metrics, and logs. It provides APIs and SDKs for most languages, so you can instrument your code once and send the data to any backend: Jaeger, Prometheus, Grafana, Datadog, or your own storage.

The value of a standard is interoperability. Without OTel, switching from one monitoring vendor to another means re-instrumenting your entire codebase. With OTel, you change a configuration file.

Cross-reference: Web Servers Overview covers server logging and monitoring.

OpenTelemetry is to observability what HTTP is to the web: a shared protocol that lets different tools work together. Instrument your application once with OTel, and send data to Jaeger for traces, Prometheus for metrics, and Loki for logs — all without changing application code.
Prefer protocols over platforms

A common suggestion I make is for developers to prefer adopting a protocol or technical standard rather than buying into the platform. Platforms can change--sometimes for the worse, but protocols persist. To follow this, think about Git over Github, HTTP, HTML, CSS, and JavaScript over some particular framework. oTel over a particular analytical system.

15. Bot Traffic

Not all HTTP requests come from humans. A significant percentage of web traffic is automated — bots, crawlers, scrapers, and scripts. For analytics, this is a fundamental data quality problem: if you can't distinguish humans from bots, your metrics are meaningless.

Types of Bots

Bot Type Purpose Example Analytics Impact
Search engine crawlers Index content for search Googlebot, Bingbot Inflate page views
SEO/monitoring bots Check uptime, rankings, links Ahrefs, Screaming Frog Inflate page views
Social media bots Generate previews/cards Twitterbot, facebookexternalhit Inflate page views, distort referrers
RSS/feed readers Pull content updates Feedly, Inoreader Inflate page views
AI training crawlers Scrape content for LLM training GPTBot, ClaudeBot, Bytespider High volume, distort all metrics
Scraper bots Extract data, prices, content Custom scripts Inflate views, may stress server
Spam bots Submit forms, post comments Various Corrupt form/conversion data
DDoS / attack bots Overwhelm server Botnets Massive metric distortion
Click fraud bots Fake ad clicks Botnets Inflate click/conversion metrics

Scale of the Problem

Industry estimates put automated traffic at 30–50% of all web requests. For a new site with little organic traffic, the bot percentage can be much higher — sometimes the majority of your "visitors" are bots. Students building course projects will see this firsthand: your analytics data will contain bot visits, and you need to account for them.

Good Bots vs. Bad Bots

Bot Detection Methods

  1. User-Agent filtering — Most good bots identify themselves ("Googlebot", "Bingbot"). But User-Agent is trivially spoofable — bad bots often claim to be Chrome.
  2. robots.txt — A convention, not enforcement. Good bots respect it; bad bots ignore it. Not a detection method per se, but a filtering signal.
  3. Rate limiting — Humans don't request 100 pages per second. Abnormal request rates signal automation.
  4. Behavioral analysis — Bots don't scroll, don't move the mouse, don't hesitate before clicking. Client-side analytics can detect the absence of human behavior patterns.
  5. JavaScript execution — Many simple bots don't execute JavaScript. If your client-side beacon never fires but the server log shows a request, it's likely a bot.
  6. CAPTCHAs — Force proof of humanity. Effective but degrades UX. Use sparingly.
  7. IP reputation lists — Known bot networks and data center IPs. Services like Project Honeypot maintain lists.
  8. Honeypot fields — Hidden form fields that humans never fill out but bots do.
Request arrives │ ├── Check IP reputation ──▶ Known bot IP? ──▶ Flag/block │ ├── Check User-Agent ────▶ Known bot UA? ──▶ Flag (good bot) or block (bad bot) │ ├── Check rate ──────────▶ Abnormal frequency? ──▶ Rate limit / CAPTCHA │ ├── Check JS execution ──▶ Beacon fired? ──▶ No = likely bot │ └── Check behavior ──────▶ Mouse movement? Scroll? ──▶ No = likely bot

Impact on Analytics

If you don't filter bots, your analytics data is compromised in multiple ways:

For the course project: you must consider bot traffic when interpreting your analytics data. If your "busiest page" has hundreds of views but zero scroll events, those are not real users.

The simplest bot filter: compare your server logs to your client-side beacon data. If a request appears in the server log but no corresponding beacon was received, the visitor likely didn't execute JavaScript — which means it's probably a bot. This is one of the strongest arguments for collecting both server logs and client-side data.
Don't assume your analytics data only contains human visitors. Industry estimates put bot traffic at 30–50% of all web requests. For a low-traffic course project site, the percentage may be even higher. Always sanity-check your data: if your "busiest page" has 500 views but 0 scroll events, those aren't real users.

16. Data Quality Challenges

No analytics system gives you perfectly accurate data. Every collection method has blind spots, and multiple factors corrupt or reduce the data you receive:

Threat Impact Mitigation
Bots and crawlers Inflate page views, distort behavior metrics Filter by User-Agent, use CAPTCHAs, analyze behavior patterns — see Section 15 for detailed bot detection strategies
Ad blockers Block third-party analytics scripts entirely Use first-party collection (same domain), server-side logging
Cookie clearing Breaks session continuity — returning users appear as new Accept approximation; use server-side session tracking
Browser caching Cached pages generate no server request Client-side scripts still fire; combine methods
CDN caching CDN-served pages never reach your server logs CDN analytics APIs; client-side collection
Device switching Same user on phone and laptop appears as two users Login-based identity; accept approximation
VPN / Proxy IP-based geolocation becomes inaccurate Use Accept-Language, timezone from JS for location hints
JavaScript disabled Client-side analytics fails completely Server logs as fallback; <noscript> pixel tracking
Incognito / private mode No persistent cookies; every visit looks new Accept approximation; focus on session-level data

The fundamental challenge is the client-server gap: the JavaScript analytics code may never load (blocked, disabled, slow connection), so client-side analytics always undercounts compared to server logs. But server logs miss cached pages. Neither method captures everything.

State management directly affects data quality: if cookies are cleared, sessions break. If incognito mode is used, there is no persistence between visits. If third-party cookies are blocked (as they increasingly are), cross-domain tracking fails.

Cross-references: State Management covers cookies and sessions; Foundations Overview covers the client-server model and trust boundaries.

No analytics system gives you ground truth. Server logs miss cached pages. Client-side scripts miss ad-blocked users. Network capture misses HTTPS content. Every method has blind spots. The pragmatic approach: combine methods, cross-validate, and accept that analytics data is always an approximation — useful but not exact. However, be very careful not to employ rigor because of this. Far too often, I have used arguments about data imperfections to excuse poor execution. If we are going to adopt a statistical mindset, you obviously want to aim high enough to be trustworthy.

17. Summary

Concept Key Takeaway
Why analytics Find errors users never report, understand behavior, measure what works
Analytics stack Collector → Storage → Reporting — you build all three in this course
Collection methods Server logs (automatic), network capture (obsolete for HTTPS), client-side scripts (flexible)
What to collect Headers and URLs automatically; JS adds clicks, scrolls, errors, timing
Enriching server logs Extend log formats with custom headers, Client Hints, and script-set cookies to upscale logs from basic hit counts to rich analytics data
1st vs 3rd party First-party = you own the data; third-party = vendor owns it and builds cross-site profiles
Browser fingerprinting Combines browser/device attributes to identify users without cookies — effective but ethically fraught and regulated by GDPR
Broad vs targeted Collect what you need, not what you might need — GDPR requires purpose
Privacy & consent GDPR, ePrivacy, and CCPA govern analytics collection — consent is required for cookies and fingerprinting; privacy-preserving tools avoid the problem entirely
Error tracking Catch silent failures with window.onerror and sendBeacon()
Performance / Web Vitals LCP, INP, CLS measure real user experience. Collect via Performance Observer API and sendBeacon. Google uses these as ranking signals.
User behavior Funnel analysis reveals where users drop off; absence of action is data
Session replay DOM-reconstruction replay, not video — powerful but privacy-sensitive
Observability Logs + Metrics + Traces; OpenTelemetry is the vendor-neutral standard
Bot traffic 30–50% of web traffic is automated — filter bots or your metrics are meaningless; compare server logs to client-side beacons as a first-pass filter
Data quality Every method has blind spots — combine methods and accept approximation

Back to Home | Data Visualization | Collector Demo | Collector Service | Foundations Overview