Module 02: Analytics Queries

The schema is built. Now write the SQL queries that power the reporting API and dashboard. These queries aggregate raw data into the metrics that matter — top pages, pageviews over time, Web Vitals percentiles, error rates, and session statistics.

Demo Files

queries.sql — 18 named analytics queries

1. Query Design Principles

Analytics queries are mostly aggregations: COUNT, AVG, SUM, GROUP BY, ORDER BY. The pattern is consistent across nearly every query in this module — select some dimension (date, page, browser), aggregate a metric, and sort the results.

Always filter by date range using WHERE server_timestamp BETWEEN @start_date AND @end_date. Without a date range, queries scan the entire table, which becomes progressively slower as data accumulates. Date filtering also ensures the API and dashboard can request specific reporting periods (last 7 days, last 30 days, custom range).

Use LIMIT for pagination and to cap result sets. A "top pages" query that returns thousands of rows is not useful for a dashboard card. Capping at 10 or 20 rows keeps responses fast and the UI clean.

Prefer server_timestamp over client_timestamp. The server timestamp is authoritative — it records when the server actually received the beacon. Client timestamps can be wrong due to incorrect device clocks, timezone mismatches, or deliberate manipulation. Use server_timestamp for all date filtering and aggregation.

All 18 queries in queries.sql use MySQL user variables (@start_date and @end_date) so you can set the date range once and run any query without modification:

SET @start_date = '2025-01-01';
SET @end_date   = '2025-01-31';

-- Now run any query from queries.sql

2. Pageview Queries

Pageviews are the most fundamental metric in web analytics. These queries answer: how much traffic is the site getting, which pages are popular, and when do visitors arrive?

Total pageviews by date

The simplest aggregation — count all pageview events grouped by day. This powers the main traffic chart on the dashboard.

SELECT DATE(server_timestamp) AS day, COUNT(*) AS pageviews
FROM pageviews
WHERE server_timestamp BETWEEN @start_date AND @end_date
  AND type = 'pageview'
GROUP BY day
ORDER BY day;

The DATE() function strips the time portion, collapsing all events from the same calendar day into one row. The type = 'pageview' filter excludes other event types (like performance or error beacons) that may share the same table.

Top pages by view count

SELECT url, COUNT(*) AS views
FROM pageviews
WHERE server_timestamp BETWEEN @start_date AND @end_date
  AND type = 'pageview'
GROUP BY url
ORDER BY views DESC
LIMIT 20;

This ranks pages by popularity. The LIMIT 20 keeps the result manageable for dashboard display. Note that URLs are grouped exactly as stored — if your collector normalizes URLs (stripping query strings, trailing slashes), the grouping will be cleaner.

Pageviews by hour of day

SELECT HOUR(server_timestamp) AS hour_of_day, COUNT(*) AS pageviews
FROM pageviews
WHERE server_timestamp BETWEEN @start_date AND @end_date
  AND type = 'pageview'
GROUP BY hour_of_day
ORDER BY hour_of_day;

This reveals traffic patterns: when are users most active? The result has at most 24 rows (hours 0 through 23). Useful for deciding when to schedule maintenance or deploy updates.

Unique sessions per day

SELECT DATE(server_timestamp) AS day, COUNT(DISTINCT session_id) AS unique_sessions
FROM pageviews
WHERE server_timestamp BETWEEN @start_date AND @end_date
GROUP BY day
ORDER BY day;

COUNT(DISTINCT session_id) deduplicates — a user who views 10 pages in one session is counted once. This gives a cleaner picture of how many visits occurred, not just how many pages were loaded.

3. Traffic Source Queries

Referrer analysis tells you where your traffic comes from. Did users arrive from a search engine, social media, another website, or by typing the URL directly?

Top referrers

SELECT
    CASE
        WHEN referrer IS NULL OR referrer = '' THEN '(direct)'
        ELSE SUBSTRING_INDEX(SUBSTRING_INDEX(referrer, '/', 3), '://', -1)
    END AS referrer_domain,
    COUNT(*) AS visits
FROM sessions
WHERE start_time BETWEEN @start_date AND @end_date
  AND referrer IS NOT NULL
GROUP BY referrer_domain
ORDER BY visits DESC
LIMIT 20;

The nested SUBSTRING_INDEX calls extract just the domain from a full URL. For example, https://www.google.com/search?q=test becomes www.google.com. Grouping by domain avoids splitting the same source across hundreds of unique URLs.

We query the sessions table rather than pageviews because the referrer on the session row reflects the external traffic source (first-touch attribution), not internal navigation between pages.

Direct vs referred traffic

SELECT
    CASE WHEN referrer IS NULL OR referrer = '' THEN 'Direct' ELSE 'Referred' END AS source,
    COUNT(*) AS sessions_count
FROM sessions
WHERE start_time BETWEEN @start_date AND @end_date
GROUP BY source;

A simple two-row result: how many sessions came from an external link vs. how many were direct (typed URL, bookmark, or missing referrer). This is one of the most basic traffic breakdowns and is useful as a dashboard card.

Referrer breakdown by domain

For a more granular view, you can extend the top referrers query by adding a HAVING visits >= 5 clause to filter out noise from one-off referrers. This is particularly useful when your site gets linked from many small sources.

4. Performance Queries

Performance data comes from the performance table, which stores load timing and Web Vitals metrics collected by the client-side beacon. These queries surface slow pages and overall site speed.

Average load time by page

SELECT url,
    ROUND(AVG(load_time)) AS avg_load_ms,
    ROUND(AVG(ttfb)) AS avg_ttfb_ms,
    COUNT(*) AS samples
FROM performance
WHERE server_timestamp BETWEEN @start_date AND @end_date
GROUP BY url
ORDER BY avg_load_ms DESC
LIMIT 20;

This ranks pages from slowest to fastest. The samples column is important context — an average based on 3 samples is much less reliable than one based on 300. When building the dashboard, you may want to filter out pages with too few samples.

Web Vitals percentiles (p50, p75, p95)

SELECT
    ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY lcp), 2) AS lcp_p50,
    ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY lcp), 2) AS lcp_p75,
    ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lcp), 2) AS lcp_p95,
    ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY cls), 4) AS cls_p50,
    ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY cls), 4) AS cls_p75,
    ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY inp), 2) AS inp_p50,
    ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY inp), 2) AS inp_p75
FROM performance
WHERE server_timestamp BETWEEN @start_date AND @end_date
  AND lcp IS NOT NULL;

Percentiles are far more useful than averages for performance data. The p75 (75th percentile) is what Google uses for Core Web Vitals assessment — it represents the experience of the slowest quarter of your users. The p50 is the median, and p95 captures the worst-case experience.

Note: PERCENTILE_CONT is a window function available in MySQL 8.0+ and MariaDB 10.3+. If your database does not support it, you can approximate percentiles using ORDER BY with LIMIT and OFFSET, though the syntax is more verbose.

Slowest pages

SELECT url,
    ROUND(AVG(load_time)) AS avg_load_ms,
    MAX(load_time) AS max_load_ms,
    COUNT(*) AS samples
FROM performance
WHERE server_timestamp BETWEEN @start_date AND @end_date
GROUP BY url
HAVING samples >= 5
ORDER BY avg_load_ms DESC
LIMIT 10;

The HAVING samples >= 5 clause filters out pages with too few data points. A page visited once with a 10-second load time is an outlier, not a pattern. We want to surface pages that are consistently slow.

TTFB distribution

Time to First Byte (TTFB) is the delay between the browser sending the request and receiving the first byte of the response. High TTFB points to server-side issues (slow database queries, cold caches, overloaded servers) rather than client-side problems. You can analyze TTFB distribution using the same percentile technique applied to the ttfb column.

5. Error Queries

JavaScript errors captured by the collector are stored in the errors table. These queries help identify the most common errors, track error trends, and pinpoint which pages are producing the most problems.

Error frequency by message

SELECT error_message, COUNT(*) AS occurrences,
    MIN(server_timestamp) AS first_seen,
    MAX(server_timestamp) AS last_seen
FROM errors
WHERE server_timestamp BETWEEN @start_date AND @end_date
GROUP BY error_message
ORDER BY occurrences DESC
LIMIT 20;

Grouping by error_message collapses duplicate errors into a single row with a count. The first_seen and last_seen columns tell you whether an error is new or has been persisting for a while.

Error trend over time

SELECT DATE(server_timestamp) AS day, COUNT(*) AS error_count
FROM errors
WHERE server_timestamp BETWEEN @start_date AND @end_date
GROUP BY day
ORDER BY day;

This mirrors the pageview-by-date query but for errors. A sudden spike in errors correlates with a deployment or an external service failure. Plotting this alongside pageviews helps distinguish "more errors because more traffic" from "more errors because something broke."

Top error-producing pages

SELECT url, COUNT(*) AS error_count
FROM errors
WHERE server_timestamp BETWEEN @start_date AND @end_date
GROUP BY url
ORDER BY error_count DESC
LIMIT 10;

Not all pages produce errors equally. This query surfaces the worst offenders so you can focus debugging effort where it matters most.

Unique errors per day

To track how many distinct error messages appear each day (rather than total occurrences), use COUNT(DISTINCT error_message). A rising count of unique errors suggests new bugs are being introduced, while a stable count with rising occurrences suggests existing bugs are affecting more users.

6. Session Queries

Session-level queries provide engagement metrics: how many visits occur, how long users stay, how many pages they view, and whether they bounce after a single page.

Sessions per day

SELECT DATE(start_time) AS day, COUNT(*) AS session_count
FROM sessions
WHERE start_time BETWEEN @start_date AND @end_date
GROUP BY day
ORDER BY day;

Average session duration and pages per session

SELECT
    ROUND(AVG(duration_seconds)) AS avg_duration_sec,
    ROUND(AVG(page_count), 1) AS avg_pages
FROM sessions
WHERE start_time BETWEEN @start_date AND @end_date;

These two numbers are core engagement metrics. Longer sessions and more pages per session generally indicate higher engagement. However, context matters — a documentation site where users find their answer on the first page is not "failing" because of short sessions.

Bounce rate

SELECT
    COUNT(*) AS total_sessions,
    SUM(CASE WHEN page_count = 1 THEN 1 ELSE 0 END) AS bounced,
    ROUND(SUM(CASE WHEN page_count = 1 THEN 1 ELSE 0 END) / COUNT(*) * 100, 1) AS bounce_rate_pct
FROM sessions
WHERE start_time BETWEEN @start_date AND @end_date;

A "bounce" is a session where the user viewed only one page and left. The bounce rate is the percentage of all sessions that were bounces. The CASE WHEN page_count = 1 expression counts single-page sessions. Dividing by the total and multiplying by 100 gives the percentage.

Interpreting bounce rate: A high bounce rate is not always bad. Landing pages, blog posts, and single-page apps may naturally have high bounce rates because users get what they need on one page. Bounce rate is most useful as a relative metric — compare it across pages or track it over time to spot changes.

Sessions by referrer source

Combining session data with referrer analysis reveals which traffic sources produce the most engaged users. You can join the sessions-per-day query with the referrer domain extraction to see average session duration by source — do users from search engines stay longer than users from social media?

7. Dashboard Summary Query

The dashboard overview typically shows four or five key metrics at a glance: total pageviews, total sessions, average load time, and total errors. Rather than running four separate queries, a single query with subqueries returns all values in one row:

SELECT
    (SELECT COUNT(*) FROM pageviews
     WHERE server_timestamp BETWEEN @start_date AND @end_date
       AND type = 'pageview') AS total_pageviews,
    (SELECT COUNT(DISTINCT session_id) FROM pageviews
     WHERE server_timestamp BETWEEN @start_date AND @end_date) AS total_sessions,
    (SELECT ROUND(AVG(load_time)) FROM performance
     WHERE server_timestamp BETWEEN @start_date AND @end_date) AS avg_load_time_ms,
    (SELECT COUNT(*) FROM errors
     WHERE server_timestamp BETWEEN @start_date AND @end_date) AS total_errors;

Each subquery hits a different table but they all share the same date range. The result is a single row with four columns, which maps directly to four dashboard cards.

Performance note: These subqueries run independently. On a large dataset, you might see better performance by running them as separate queries in parallel from your application code, rather than as subqueries. For a course project with moderate data volumes, the single-query approach is simpler and fast enough.

This query is the entry point for the reporting API. When the dashboard loads, it makes one request that calls this query, and the four returned values populate the summary cards at the top of the page.

8. Browser & Device Queries

Understanding your audience's browsers and devices helps you make informed decisions about which features to use, what to polyfill, and whether your responsive design is working.

Browser breakdown

SELECT
    CASE
        WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
        WHEN user_agent LIKE '%Edg/%' THEN 'Edge'
        WHEN user_agent LIKE '%Chrome%' THEN 'Chrome'
        WHEN user_agent LIKE '%Safari%' THEN 'Safari'
        ELSE 'Other'
    END AS browser,
    COUNT(*) AS pageviews
FROM pageviews
WHERE server_timestamp BETWEEN @start_date AND @end_date
  AND type = 'pageview'
GROUP BY browser
ORDER BY pageviews DESC;

The CASE WHEN LIKE pattern is a simplified approach to user-agent parsing. The order matters — Edge's user-agent string contains "Chrome", so we must check for Edg/ before Chrome. Similarly, Chrome's user-agent contains "Safari", so the Chrome check must come before Safari.

User-agent caveats: User-agent strings are notoriously unreliable. Browsers lie about their identity for compatibility reasons, and the strings grow ever more convoluted. This simplified parsing is fine for a course project dashboard, but production analytics tools use dedicated UA-parsing libraries that handle the hundreds of edge cases.

Viewport size distribution

SELECT
    CASE
        WHEN viewport_width < 768 THEN 'Mobile (<768px)'
        WHEN viewport_width < 1024 THEN 'Tablet (768-1023px)'
        WHEN viewport_width < 1440 THEN 'Desktop (1024-1439px)'
        ELSE 'Large (1440px+)'
    END AS device_class,
    COUNT(*) AS pageviews
FROM pageviews
WHERE server_timestamp BETWEEN @start_date AND @end_date
  AND type = 'pageview'
  AND viewport_width IS NOT NULL
GROUP BY device_class
ORDER BY pageviews DESC;

Viewport width is a more reliable device signal than user-agent parsing. The breakpoints (768px, 1024px, 1440px) correspond to common CSS media query thresholds. This tells you what percentage of your users are on mobile, tablet, or desktop — critical information for prioritizing responsive design work.

The viewport_width IS NOT NULL filter excludes beacons where the collector did not capture viewport data (such as server-side log entries or malformed beacons).

← Previous: Schema Design Next: Data Retention →