Web Analytics: Review Questions

Self-Study & Discussion

These questions cover the 17 sections of the Web Analytics overview and are organized into 5 topic clusters. No answers are provided — the goal is to test your understanding of analytics collection methods, privacy requirements, performance measurement, and data quality challenges.

The questions mix conceptual understanding, technical reasoning, and privacy/ethical trade-offs.

Cluster 1: Foundations — Why Analytics, The Stack & Collection Methods (Sections 1–3)

The overview identifies three distinct purposes for analytics: error tracking, user behavior monitoring, and usability measurement. Map each purpose to one of the three participant groups (developers, business, users). Why does the overview say "the same data answers different questions depending on who is asking"?
Describe the three parts of the analytics pipeline (Collection, Storage, Reporting). In the course project, you build all three — explain what each part does and what technology choices are involved at each stage.
Compare the three data collection methods (server logs, network capture, client-side scripts) across four dimensions: what they capture, whether they require code changes, whether they work with HTTPS, and their privacy implications. Why does the overview say network capture is "mostly obsolete"?
Server logs require no code changes and capture every request automatically. Client-side scripts require JavaScript deployment but capture clicks, scrolls, and errors. Why does the overview recommend combining both methods rather than choosing one? What does each method miss that the other captures?
The overview warns against "just collecting data and looking for insights" and instead advocates determining questions first. Explain this distinction. Why is hypothesis-driven analytics more reliable than exploratory data mining?

Cluster 2: Data Collection — What to Collect & Enriching Logs (Sections 4–5)

The overview categorizes collectible data into four sources: HTTP headers, URLs, server data, and JavaScript. For each source, give two specific examples and explain whether collection is automatic or requires code.
Compare Common Log Format, Combined format, and Custom/Extended log formats. What does the Combined format add over CLF, and why is it "the analytics minimum"? What can Custom formats capture that Combined cannot?
Explain how Client Hints work: what does the server send to request them, what does the browser send back, and how do they improve on User-Agent strings? Why are they described as "a philosophical shift"? What is their major limitation?
Describe the "script-to-header" technique for bridging client-side and server-side data. How does JavaScript set a cookie that the server then logs? What is the "one request behind" limitation, and why does this technique work best for data that changes infrequently?
The overview describes log forwarding as a hybrid model that blurs the first-party / third-party line. Explain: how is collection first-party but analysis potentially third-party? What privacy advantage does this have over traditional third-party analytics?

Cluster 3: Identity, Ownership & Privacy (Sections 6–9)

Compare first-party and third-party analytics across data ownership, cookie scope, ad blocker impact, and cost. The overview says third-party analytics is "perceptually free" — explain what you actually pay with and why this matters under GDPR.
Explain how browser fingerprinting works: what attributes are combined, how is the hash generated, and why is it "probabilistic, not exact"? Why does incognito mode NOT defeat fingerprinting? Compare fingerprinting to cookies across stability, accuracy, and user control.
The overview presents two collection philosophies: broad ("collect everything") and targeted ("collect specific things"). Compare them across data volume, privacy risk, and GDPR alignment. Why does the overview argue that information hoarding is a "serious privacy risk"?
List the six GDPR principles relevant to analytics. Explain data minimization and purpose limitation with specific analytics examples. Why does the overview say "'we might need it someday' is not a purpose"?
The overview lists six analytics activities and whether each requires consent: server logs, first-party cookies, fingerprinting, session replay, third-party analytics, and aggregate cookie-free metrics. Explain the legal reasoning behind each. Why can aggregate, cookie-free analytics often avoid consent requirements entirely?

Cluster 4: Developer Analytics — Errors, Performance & Behavior (Sections 10–13)

Explain how navigator.sendBeacon() differs from fetch() or XMLHttpRequest for analytics. Why is it "essential for capturing exit events and errors"? What happens to a fetch() call when the user navigates away?
Describe the three Core Web Vitals (LCP, INP, CLS): what each measures, its "good" threshold, and why Google uses them as search ranking signals. Why are these measured on real users (RUM) rather than lab tests?
Compare Real User Monitoring (RUM) and Synthetic Monitoring across data source, variability, coverage, and use case. The overview says "RUM is analytics, synthetic is testing" — explain what this means. Which one are you building in the course project?
Explain the conversion funnel concept: Page Views → Engagement → Action → Conversion. What analytics data do you need to measure each stage? The overview says "the biggest usability insights often come from what users do NOT do" — give two examples.
How does session replay work technically? Why is it "DOM reconstruction, not video"? What are its privacy implications, and what data must always be masked? Compare the qualitative insight from replay to the quantitative insight from aggregate metrics.

Cluster 5: Data Integrity — Observability, Bots & Quality (Sections 14–16)

Explain the three pillars of observability (Logs, Metrics, Traces) and what question each answers. The overview says "analytics is observability for user behavior" — explain the parallel. How might a slow page load involve both analytics (user experience) and observability (system behavior)?
What is OpenTelemetry, and why does the overview compare it to "what HTTP is to the web"? What problem does vendor-neutral instrumentation solve? Explain the overview's advice to "prefer protocols over platforms."
The overview lists nine types of bots. Categorize them as good, bad, or gray-area, and explain the analytics impact of each category. Why is bot traffic a "fundamental data quality problem"?
Describe five bot detection methods from the overview. Why is User-Agent filtering alone insufficient? Explain the overview's recommended first-pass filter: comparing server logs to client-side beacon data.
The overview identifies nine data quality threats (bots, ad blockers, cookie clearing, caching, CDN, device switching, VPN, JS disabled, incognito). Pick four and explain the specific analytics impact and mitigation. Why does the overview say "every method has blind spots" and recommend combining methods?