Module 01: Schema Design

This module walks through the complete MySQL schema for the analytics pipeline — all 6 tables, their columns, indexes, and relationships. By the end you will have a working database ready for ingestion and reporting.

Demo Files

schema.sql — Complete DDL for all 6 tables
seed.sql — Realistic test data (50-100 rows)

1. Creating the Database

Before creating any tables we need a database. The analytics database uses utf8mb4 as the default character set, which supports the full Unicode range including emoji and CJK characters. The utf8mb4_unicode_ci collation provides case-insensitive comparisons with proper Unicode sorting.

CREATE DATABASE IF NOT EXISTS analytics
  DEFAULT CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

USE analytics;

Why utf8mb4? MySQL's older utf8 character set only supports 3-byte characters, which excludes many Unicode code points (including emoji). The utf8mb4 encoding uses up to 4 bytes per character, covering the entire Unicode standard. Always use utf8mb4 for new databases.

Every table that follows uses the InnoDB storage engine. InnoDB provides row-level locking, foreign key support, and crash-safe transactions — all essential for a system that ingests concurrent writes from a live collector endpoint.

2. The pageviews Table

The pageviews table is the primary fact table in the schema. Every beacon the collector receives becomes a row in this table. Let us walk through each column:

Column	Type	Purpose
`id`	INT AUTO_INCREMENT	Surrogate primary key. Every row gets a unique integer.
`url`	VARCHAR(2048)	The page URL. 2048 characters accommodates long query strings.
`type`	ENUM('pageview','event','error','performance')	Beacon type. ENUM enforces valid values at the database level.
`user_agent`	VARCHAR(512)	Browser user-agent string. 512 chars covers even bloated UA strings.
`viewport_width`	SMALLINT UNSIGNED	Browser viewport width in pixels. Max 65,535 — more than enough.
`viewport_height`	SMALLINT UNSIGNED	Browser viewport height in pixels.
`referrer`	VARCHAR(2048)	Referring URL. Same length as `url` since referrers can be long.
`client_timestamp`	DATETIME	When the event occurred on the client. May be inaccurate (clock skew).
`server_timestamp`	DATETIME NOT NULL	When the server received the beacon. This is the canonical timestamp.
`client_ip`	VARCHAR(45)	Client IP. 45 chars fits both IPv4 and full IPv6 addresses.
`session_id`	VARCHAR(64)	Client-generated session identifier for sessionization.
`payload`	JSON	Flexible field for any extra data the beacon sends.
`created_at`	TIMESTAMP DEFAULT CURRENT_TIMESTAMP	Row insertion timestamp (database server time).

Why VARCHAR(2048) for URLs? While most URLs are under 200 characters, query strings can get very long. Google search result URLs, UTM-tagged marketing links, and single-page apps with encoded state can easily reach 1000+ characters. The HTTP spec does not define a maximum URL length, but 2048 is the practical limit enforced by most browsers (Internet Explorer historically capped at 2,083 characters). Using 2048 ensures we never truncate a URL.

Why ENUM for type? An ENUM column stores values as integers internally (1 byte for up to 255 values), making it compact and fast to index. More importantly, it prevents invalid values from entering the database. If a beacon sends type: "foo", the INSERT will fail rather than silently storing garbage.

Why JSON for payload? The payload column holds arbitrary key-value pairs that vary by beacon type. A pageview might include scroll depth data; an event might include a clicked element's selector. MySQL 8.0's native JSON type validates the JSON on write and supports indexed virtual columns if you need to query specific keys later.

Indexes

INDEX idx_pv_url (url(255)),
INDEX idx_pv_type (type),
INDEX idx_pv_server_timestamp (server_timestamp),
INDEX idx_pv_session_id (session_id)

The url(255) is a prefix index — MySQL indexes only the first 255 characters of the URL. This is a deliberate tradeoff: the index is smaller and faster, and most URL filtering queries match on the path portion (well under 255 characters). InnoDB limits index key length to 3072 bytes; indexing a full VARCHAR(2048) in utf8mb4 (4 bytes per character) would require 8192 bytes, which exceeds that limit.

Prefix indexes and equality: A prefix index on url(255) works well for WHERE url LIKE '/about%' queries, but it cannot guarantee uniqueness or exact equality. For our analytics use case — filtering and grouping — prefix indexes are ideal.

3. The events Table

The events table stores custom events: button clicks, scroll milestones, form submissions, video plays, or any interaction you want to track beyond simple pageviews.

Column	Type	Purpose
`id`	INT AUTO_INCREMENT	Primary key
`session_id`	VARCHAR(64)	Links the event to a session
`event_name`	VARCHAR(128) NOT NULL	What happened: `"click"`, `"scroll_50"`, `"form_submit"`
`event_category`	VARCHAR(128)	Grouping label: `"navigation"`, `"engagement"`, `"conversion"`
`event_data`	JSON	Arbitrary metadata: `{"button": "signup", "position": "header"}`
`url`	VARCHAR(2048)	Page where the event occurred
`server_timestamp`	DATETIME NOT NULL	When the server received the event
`created_at`	TIMESTAMP DEFAULT CURRENT_TIMESTAMP	Row creation time

The combination of event_name and event_category follows the same pattern used by Google Analytics. The event_name describes the specific action, while event_category groups related actions together for reporting.

INDEX idx_ev_event_name (event_name),
INDEX idx_ev_session_id (session_id),
INDEX idx_ev_server_timestamp (server_timestamp)

The index on event_name accelerates queries like SELECT COUNT(*) FROM events WHERE event_name = 'click'. The server_timestamp index supports time-range filtering, which is the most common query pattern in analytics dashboards (e.g., "show me events from the last 7 days").

4. The errors Table

The errors table captures JavaScript errors reported by the collector's error-tracking module. Each row represents one uncaught exception or window.onerror event.

Column	Type	Purpose
`id`	INT AUTO_INCREMENT	Primary key
`session_id`	VARCHAR(64)	Links the error to a session
`error_message`	VARCHAR(1024) NOT NULL	The error message string (e.g., `"Cannot read properties of undefined"`)
`error_source`	VARCHAR(2048)	Script URL that generated the error
`error_line`	INT UNSIGNED	Line number in the source file
`error_column`	INT UNSIGNED	Column number in the source file
`stack_trace`	TEXT	Full stack trace string
`url`	VARCHAR(2048)	Page URL where the error occurred
`user_agent`	VARCHAR(512)	Browser user-agent (useful for browser-specific bugs)
`server_timestamp`	DATETIME NOT NULL	When the server received the error report
`created_at`	TIMESTAMP DEFAULT CURRENT_TIMESTAMP	Row creation time

Why TEXT for stack_trace instead of VARCHAR? Stack traces can be extremely long, especially with deeply nested call stacks or minified code where the entire bundle appears in the trace. TEXT supports up to 65,535 bytes, while VARCHAR maxes out at 65,535 bytes total for the entire row. Since this table already has several large VARCHAR columns (url, error_source, user_agent), using TEXT for stack_trace avoids hitting the row-size limit.

TEXT columns are stored off-page in InnoDB (only a 20-byte pointer is stored in the row), which keeps the table compact for queries that do not need the full stack trace.

INDEX idx_err_message (error_message(255)),
INDEX idx_err_session_id (session_id),
INDEX idx_err_server_timestamp (server_timestamp)

The prefix index error_message(255) is critical. Error messages often start with the same pattern ("TypeError: Cannot read..."), so indexing the first 255 characters captures enough to accelerate grouping by error type. You cannot index a full VARCHAR(1024) in utf8mb4 without exceeding InnoDB's key length limit.

5. The performance Table

The performance table stores both Navigation Timing API metrics and Core Web Vitals. This data comes from the collector's performance-timing and web-vitals modules.

Column	Type	Purpose
`id`	INT AUTO_INCREMENT	Primary key
`session_id`	VARCHAR(64)	Links to the session
`url`	VARCHAR(2048) NOT NULL	Page that was measured
`ttfb`	INT UNSIGNED	Time to First Byte in milliseconds
`dom_content_loaded`	INT UNSIGNED	DOMContentLoaded event time in ms
`dom_complete`	INT UNSIGNED	domComplete time in ms
`load_time`	INT UNSIGNED	loadEventEnd time in ms
`lcp`	DECIMAL(10,2)	Largest Contentful Paint in ms
`cls`	DECIMAL(8,4)	Cumulative Layout Shift (unitless score)
`inp`	DECIMAL(10,2)	Interaction to Next Paint in ms
`fcp`	DECIMAL(10,2)	First Contentful Paint in ms
`transfer_size`	INT UNSIGNED	Total transfer size in bytes
`resource_count`	SMALLINT UNSIGNED	Number of resources loaded
`server_timestamp`	DATETIME NOT NULL	When the server received the data
`created_at`	TIMESTAMP DEFAULT CURRENT_TIMESTAMP	Row creation time

Why DECIMAL for Web Vitals? Navigation Timing values (ttfb, dom_content_loaded, dom_complete, load_time) are whole-millisecond integers, so INT UNSIGNED is the right type. Web Vitals, however, are reported with fractional precision:

LCP (Largest Contentful Paint): measured in milliseconds with decimal precision, e.g., 2345.67
CLS (Cumulative Layout Shift): a unitless score that is typically a small decimal, e.g., 0.1042. The DECIMAL(8,4) type stores 4 decimal places, which is the precision reported by the Web Vitals library.
INP (Interaction to Next Paint): milliseconds with decimal precision, e.g., 128.50
FCP (First Contentful Paint): milliseconds with decimal precision, e.g., 1823.40

Using DECIMAL instead of FLOAT avoids floating-point rounding errors. When you compute percentiles (p75, p95) for performance dashboards, even small rounding errors can accumulate across thousands of rows. DECIMAL stores exact values.

Core Web Vitals thresholds (for reference when building dashboards):

LCP: Good < 2500ms, Needs Improvement < 4000ms, Poor ≥ 4000ms
CLS: Good < 0.1, Needs Improvement < 0.25, Poor ≥ 0.25
INP: Good < 200ms, Needs Improvement < 500ms, Poor ≥ 500ms

6. The sessions Table

The sessions table holds aggregated visit data. Each row represents one user session as determined by the sessionization module. This is a derived table — it is populated by server-side logic that groups pageviews using a 30-minute inactivity timeout.

Column	Type	Purpose
`id`	INT AUTO_INCREMENT	Primary key
`session_id`	VARCHAR(64) NOT NULL	Client-generated session identifier
`first_page`	VARCHAR(2048) NOT NULL	Landing page (entry point of the visit)
`last_page`	VARCHAR(2048) NOT NULL	Most recent page viewed
`page_count`	INT UNSIGNED NOT NULL DEFAULT 1	Total pages in the session
`start_time`	DATETIME NOT NULL	When the session began
`last_activity`	DATETIME NOT NULL	Timestamp of the most recent pageview
`duration_seconds`	INT UNSIGNED NOT NULL DEFAULT 0	Elapsed time from first to last pageview
`referrer`	VARCHAR(2048)	First-touch referrer (traffic source)
`user_agent`	VARCHAR(512)	Browser user-agent string
`created_at`	TIMESTAMP DEFAULT CURRENT_TIMESTAMP	Row creation time

The sessions table is updated in place as new pageviews arrive. When a pageview comes in, the sessionization logic checks whether an active session exists for that session_id. If the gap since last_activity is under 30 minutes, the existing row is updated (incrementing page_count, updating last_page and last_activity, recalculating duration_seconds). If the gap exceeds 30 minutes, a new row is inserted.

One session_id, multiple sessions: A single session_id can produce multiple rows in this table. If a user visits at 10:00 AM, leaves, and returns at 3:00 PM, the same session_id maps to two session rows with different start_time values.

7. The users Table

The users table stores dashboard login accounts. These are not the visitors being tracked — they are the people who log in to the analytics dashboard to view reports.

Column	Type	Purpose
`id`	INT AUTO_INCREMENT	Primary key
`email`	VARCHAR(255) NOT NULL UNIQUE	Login identifier. UNIQUE constraint prevents duplicate accounts.
`password_hash`	VARCHAR(255) NOT NULL	Bcrypt hash of the password. Never store plain text.
`display_name`	VARCHAR(128) NOT NULL	Name shown in the dashboard UI
`role`	ENUM('owner','admin','viewer') NOT NULL DEFAULT 'viewer'	Permission level
`created_at`	TIMESTAMP DEFAULT CURRENT_TIMESTAMP	When the account was created
`last_login`	DATETIME	Most recent successful login time

Why VARCHAR(255) for password_hash? Bcrypt hashes are always 60 characters. So why 255? Because password hashing algorithms evolve. If you later migrate to Argon2id (which produces longer hashes), a 255-character column accommodates the change without an ALTER TABLE. It is a forward-compatibility buffer at negligible storage cost.

Role ENUM: The three roles define a simple permission hierarchy:

owner — Full access. Can manage users and configuration.
admin — Can view all data and manage some settings.
viewer — Read-only access to dashboards and reports.

Security note: The password_hash column stores the output of bcrypt (in PHP: password_hash(); in Node.js: bcrypt.hash()). A bcrypt hash looks like $2b$10$N9qo8uLOickgx2ZMRZoMy... — the $2b$10$ prefix indicates the bcrypt algorithm with a cost factor of 10. Never store raw passwords.

8. Loading Test Data

The seed.sql file contains INSERT statements that populate all 6 tables with realistic test data. To load it:

# First, create the schema
mysql -u root -p < schema.sql

# Then load seed data
mysql -u root -p analytics < seed.sql

After loading, verify that data was inserted correctly:

USE analytics;

SELECT 'pageviews' AS tbl, COUNT(*) AS rows FROM pageviews
UNION ALL
SELECT 'events',       COUNT(*) FROM events
UNION ALL
SELECT 'errors',       COUNT(*) FROM errors
UNION ALL
SELECT 'performance',  COUNT(*) FROM performance
UNION ALL
SELECT 'sessions',     COUNT(*) FROM sessions
UNION ALL
SELECT 'users',        COUNT(*) FROM users;

Expected output:

+-------------+------+
| tbl         | rows |
+-------------+------+
| pageviews   |   30 |
| events      |   10 |
| errors      |    8 |
| performance |   15 |
| sessions    |    5 |
| users       |    3 |
+-------------+------+

You can also spot-check individual tables:

-- Check pageview distribution by type
SELECT type, COUNT(*) AS cnt
FROM pageviews
GROUP BY type;

-- Check session durations
SELECT session_id, page_count, duration_seconds,
       first_page, last_page
FROM sessions
ORDER BY start_time;

9. Verifying with EXPLAIN

Indexes only help if the query planner actually uses them. The EXPLAIN statement shows you the execution plan MySQL will use for a query — which indexes it selects, how many rows it estimates scanning, and whether it needs a full table scan.

Let us verify that our indexes work on a common analytics query — counting pageviews by day for the last 7 days:

EXPLAIN SELECT
    DATE(server_timestamp) AS day,
    COUNT(*) AS views
FROM pageviews
WHERE server_timestamp >= NOW() - INTERVAL 7 DAY
GROUP BY day
ORDER BY day;

Expected output (simplified):

+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+
| id | select_type | table     | type  | possible_keys             | key                       | key_len | ref  | rows | Extra       |
+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+
|  1 | SIMPLE      | pageviews | range | idx_pv_server_timestamp   | idx_pv_server_timestamp   | 5       | NULL |   15 | Using where |
+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+

Key things to look for in the EXPLAIN output:

Column	What You Want to See
`type`	`range`, `ref`, or `eq_ref` — not `ALL` (full table scan)
`key`	The name of your index (e.g., `idx_pv_server_timestamp`)
`rows`	A number much smaller than the total row count
`Extra`	`Using where` is fine; `Using filesort` on large tables is a warning sign

Try EXPLAIN on a few more queries to confirm index usage:

-- Session lookup (should use idx_ev_session_id)
EXPLAIN SELECT * FROM events WHERE session_id = 'sess_abc123';

-- Error grouping (should use idx_err_message)
EXPLAIN SELECT error_message, COUNT(*) AS cnt
FROM errors
GROUP BY error_message
ORDER BY cnt DESC
LIMIT 10;

-- Performance by URL (should use idx_perf_url)
EXPLAIN SELECT url, AVG(lcp) AS avg_lcp, AVG(cls) AS avg_cls
FROM performance
GROUP BY url;

Small-table caveat: With only 30-100 rows of seed data, MySQL's optimizer may choose a full table scan over an index scan because it is faster for tiny tables. This is normal. The indexes will be used once the tables grow to thousands of rows. The important thing is that possible_keys lists your index — that confirms MySQL knows the index exists and could use it.

← Back to Storage Next: Analytics Queries →