Module 01: Schema Design

This module walks through the complete MySQL schema for the analytics pipeline — all 6 tables, their columns, indexes, and relationships. By the end you will have a working database ready for ingestion and reporting.

Demo Files

1. Creating the Database

Before creating any tables we need a database. The analytics database uses utf8mb4 as the default character set, which supports the full Unicode range including emoji and CJK characters. The utf8mb4_unicode_ci collation provides case-insensitive comparisons with proper Unicode sorting.

CREATE DATABASE IF NOT EXISTS analytics
  DEFAULT CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

USE analytics;
Why utf8mb4? MySQL's older utf8 character set only supports 3-byte characters, which excludes many Unicode code points (including emoji). The utf8mb4 encoding uses up to 4 bytes per character, covering the entire Unicode standard. Always use utf8mb4 for new databases.

Every table that follows uses the InnoDB storage engine. InnoDB provides row-level locking, foreign key support, and crash-safe transactions — all essential for a system that ingests concurrent writes from a live collector endpoint.

2. The pageviews Table

The pageviews table is the primary fact table in the schema. Every beacon the collector receives becomes a row in this table. Let us walk through each column:

Column Type Purpose
id INT AUTO_INCREMENT Surrogate primary key. Every row gets a unique integer.
url VARCHAR(2048) The page URL. 2048 characters accommodates long query strings.
type ENUM('pageview','event','error','performance') Beacon type. ENUM enforces valid values at the database level.
user_agent VARCHAR(512) Browser user-agent string. 512 chars covers even bloated UA strings.
viewport_width SMALLINT UNSIGNED Browser viewport width in pixels. Max 65,535 — more than enough.
viewport_height SMALLINT UNSIGNED Browser viewport height in pixels.
referrer VARCHAR(2048) Referring URL. Same length as url since referrers can be long.
client_timestamp DATETIME When the event occurred on the client. May be inaccurate (clock skew).
server_timestamp DATETIME NOT NULL When the server received the beacon. This is the canonical timestamp.
client_ip VARCHAR(45) Client IP. 45 chars fits both IPv4 and full IPv6 addresses.
session_id VARCHAR(64) Client-generated session identifier for sessionization.
payload JSON Flexible field for any extra data the beacon sends.
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP Row insertion timestamp (database server time).

Why VARCHAR(2048) for URLs? While most URLs are under 200 characters, query strings can get very long. Google search result URLs, UTM-tagged marketing links, and single-page apps with encoded state can easily reach 1000+ characters. The HTTP spec does not define a maximum URL length, but 2048 is the practical limit enforced by most browsers (Internet Explorer historically capped at 2,083 characters). Using 2048 ensures we never truncate a URL.

Why ENUM for type? An ENUM column stores values as integers internally (1 byte for up to 255 values), making it compact and fast to index. More importantly, it prevents invalid values from entering the database. If a beacon sends type: "foo", the INSERT will fail rather than silently storing garbage.

Why JSON for payload? The payload column holds arbitrary key-value pairs that vary by beacon type. A pageview might include scroll depth data; an event might include a clicked element's selector. MySQL 8.0's native JSON type validates the JSON on write and supports indexed virtual columns if you need to query specific keys later.

Indexes

INDEX idx_pv_url (url(255)),
INDEX idx_pv_type (type),
INDEX idx_pv_server_timestamp (server_timestamp),
INDEX idx_pv_session_id (session_id)

The url(255) is a prefix index — MySQL indexes only the first 255 characters of the URL. This is a deliberate tradeoff: the index is smaller and faster, and most URL filtering queries match on the path portion (well under 255 characters). InnoDB limits index key length to 3072 bytes; indexing a full VARCHAR(2048) in utf8mb4 (4 bytes per character) would require 8192 bytes, which exceeds that limit.

Prefix indexes and equality: A prefix index on url(255) works well for WHERE url LIKE '/about%' queries, but it cannot guarantee uniqueness or exact equality. For our analytics use case — filtering and grouping — prefix indexes are ideal.

3. The events Table

The events table stores custom events: button clicks, scroll milestones, form submissions, video plays, or any interaction you want to track beyond simple pageviews.

Column Type Purpose
id INT AUTO_INCREMENT Primary key
session_id VARCHAR(64) Links the event to a session
event_name VARCHAR(128) NOT NULL What happened: "click", "scroll_50", "form_submit"
event_category VARCHAR(128) Grouping label: "navigation", "engagement", "conversion"
event_data JSON Arbitrary metadata: {"button": "signup", "position": "header"}
url VARCHAR(2048) Page where the event occurred
server_timestamp DATETIME NOT NULL When the server received the event
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP Row creation time

The combination of event_name and event_category follows the same pattern used by Google Analytics. The event_name describes the specific action, while event_category groups related actions together for reporting.

INDEX idx_ev_event_name (event_name),
INDEX idx_ev_session_id (session_id),
INDEX idx_ev_server_timestamp (server_timestamp)

The index on event_name accelerates queries like SELECT COUNT(*) FROM events WHERE event_name = 'click'. The server_timestamp index supports time-range filtering, which is the most common query pattern in analytics dashboards (e.g., "show me events from the last 7 days").

4. The errors Table

The errors table captures JavaScript errors reported by the collector's error-tracking module. Each row represents one uncaught exception or window.onerror event.

Column Type Purpose
id INT AUTO_INCREMENT Primary key
session_id VARCHAR(64) Links the error to a session
error_message VARCHAR(1024) NOT NULL The error message string (e.g., "Cannot read properties of undefined")
error_source VARCHAR(2048) Script URL that generated the error
error_line INT UNSIGNED Line number in the source file
error_column INT UNSIGNED Column number in the source file
stack_trace TEXT Full stack trace string
url VARCHAR(2048) Page URL where the error occurred
user_agent VARCHAR(512) Browser user-agent (useful for browser-specific bugs)
server_timestamp DATETIME NOT NULL When the server received the error report
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP Row creation time

Why TEXT for stack_trace instead of VARCHAR? Stack traces can be extremely long, especially with deeply nested call stacks or minified code where the entire bundle appears in the trace. TEXT supports up to 65,535 bytes, while VARCHAR maxes out at 65,535 bytes total for the entire row. Since this table already has several large VARCHAR columns (url, error_source, user_agent), using TEXT for stack_trace avoids hitting the row-size limit.

TEXT columns are stored off-page in InnoDB (only a 20-byte pointer is stored in the row), which keeps the table compact for queries that do not need the full stack trace.

INDEX idx_err_message (error_message(255)),
INDEX idx_err_session_id (session_id),
INDEX idx_err_server_timestamp (server_timestamp)

The prefix index error_message(255) is critical. Error messages often start with the same pattern ("TypeError: Cannot read..."), so indexing the first 255 characters captures enough to accelerate grouping by error type. You cannot index a full VARCHAR(1024) in utf8mb4 without exceeding InnoDB's key length limit.

5. The performance Table

The performance table stores both Navigation Timing API metrics and Core Web Vitals. This data comes from the collector's performance-timing and web-vitals modules.

Column Type Purpose
id INT AUTO_INCREMENT Primary key
session_id VARCHAR(64) Links to the session
url VARCHAR(2048) NOT NULL Page that was measured
ttfb INT UNSIGNED Time to First Byte in milliseconds
dom_content_loaded INT UNSIGNED DOMContentLoaded event time in ms
dom_complete INT UNSIGNED domComplete time in ms
load_time INT UNSIGNED loadEventEnd time in ms
lcp DECIMAL(10,2) Largest Contentful Paint in ms
cls DECIMAL(8,4) Cumulative Layout Shift (unitless score)
inp DECIMAL(10,2) Interaction to Next Paint in ms
fcp DECIMAL(10,2) First Contentful Paint in ms
transfer_size INT UNSIGNED Total transfer size in bytes
resource_count SMALLINT UNSIGNED Number of resources loaded
server_timestamp DATETIME NOT NULL When the server received the data
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP Row creation time

Why DECIMAL for Web Vitals? Navigation Timing values (ttfb, dom_content_loaded, dom_complete, load_time) are whole-millisecond integers, so INT UNSIGNED is the right type. Web Vitals, however, are reported with fractional precision:

Using DECIMAL instead of FLOAT avoids floating-point rounding errors. When you compute percentiles (p75, p95) for performance dashboards, even small rounding errors can accumulate across thousands of rows. DECIMAL stores exact values.

Core Web Vitals thresholds (for reference when building dashboards):

6. The sessions Table

The sessions table holds aggregated visit data. Each row represents one user session as determined by the sessionization module. This is a derived table — it is populated by server-side logic that groups pageviews using a 30-minute inactivity timeout.

Column Type Purpose
id INT AUTO_INCREMENT Primary key
session_id VARCHAR(64) NOT NULL Client-generated session identifier
first_page VARCHAR(2048) NOT NULL Landing page (entry point of the visit)
last_page VARCHAR(2048) NOT NULL Most recent page viewed
page_count INT UNSIGNED NOT NULL DEFAULT 1 Total pages in the session
start_time DATETIME NOT NULL When the session began
last_activity DATETIME NOT NULL Timestamp of the most recent pageview
duration_seconds INT UNSIGNED NOT NULL DEFAULT 0 Elapsed time from first to last pageview
referrer VARCHAR(2048) First-touch referrer (traffic source)
user_agent VARCHAR(512) Browser user-agent string
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP Row creation time

The sessions table is updated in place as new pageviews arrive. When a pageview comes in, the sessionization logic checks whether an active session exists for that session_id. If the gap since last_activity is under 30 minutes, the existing row is updated (incrementing page_count, updating last_page and last_activity, recalculating duration_seconds). If the gap exceeds 30 minutes, a new row is inserted.

One session_id, multiple sessions: A single session_id can produce multiple rows in this table. If a user visits at 10:00 AM, leaves, and returns at 3:00 PM, the same session_id maps to two session rows with different start_time values.

7. The users Table

The users table stores dashboard login accounts. These are not the visitors being tracked — they are the people who log in to the analytics dashboard to view reports.

Column Type Purpose
id INT AUTO_INCREMENT Primary key
email VARCHAR(255) NOT NULL UNIQUE Login identifier. UNIQUE constraint prevents duplicate accounts.
password_hash VARCHAR(255) NOT NULL Bcrypt hash of the password. Never store plain text.
display_name VARCHAR(128) NOT NULL Name shown in the dashboard UI
role ENUM('owner','admin','viewer') NOT NULL DEFAULT 'viewer' Permission level
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP When the account was created
last_login DATETIME Most recent successful login time

Why VARCHAR(255) for password_hash? Bcrypt hashes are always 60 characters. So why 255? Because password hashing algorithms evolve. If you later migrate to Argon2id (which produces longer hashes), a 255-character column accommodates the change without an ALTER TABLE. It is a forward-compatibility buffer at negligible storage cost.

Role ENUM: The three roles define a simple permission hierarchy:

Security note: The password_hash column stores the output of bcrypt (in PHP: password_hash(); in Node.js: bcrypt.hash()). A bcrypt hash looks like $2b$10$N9qo8uLOickgx2ZMRZoMy... — the $2b$10$ prefix indicates the bcrypt algorithm with a cost factor of 10. Never store raw passwords.

8. Loading Test Data

The seed.sql file contains INSERT statements that populate all 6 tables with realistic test data. To load it:

# First, create the schema
mysql -u root -p < schema.sql

# Then load seed data
mysql -u root -p analytics < seed.sql

After loading, verify that data was inserted correctly:

USE analytics;

SELECT 'pageviews' AS tbl, COUNT(*) AS rows FROM pageviews
UNION ALL
SELECT 'events',       COUNT(*) FROM events
UNION ALL
SELECT 'errors',       COUNT(*) FROM errors
UNION ALL
SELECT 'performance',  COUNT(*) FROM performance
UNION ALL
SELECT 'sessions',     COUNT(*) FROM sessions
UNION ALL
SELECT 'users',        COUNT(*) FROM users;

Expected output:

+-------------+------+
| tbl         | rows |
+-------------+------+
| pageviews   |   30 |
| events      |   10 |
| errors      |    8 |
| performance |   15 |
| sessions    |    5 |
| users       |    3 |
+-------------+------+

You can also spot-check individual tables:

-- Check pageview distribution by type
SELECT type, COUNT(*) AS cnt
FROM pageviews
GROUP BY type;

-- Check session durations
SELECT session_id, page_count, duration_seconds,
       first_page, last_page
FROM sessions
ORDER BY start_time;

9. Verifying with EXPLAIN

Indexes only help if the query planner actually uses them. The EXPLAIN statement shows you the execution plan MySQL will use for a query — which indexes it selects, how many rows it estimates scanning, and whether it needs a full table scan.

Let us verify that our indexes work on a common analytics query — counting pageviews by day for the last 7 days:

EXPLAIN SELECT
    DATE(server_timestamp) AS day,
    COUNT(*) AS views
FROM pageviews
WHERE server_timestamp >= NOW() - INTERVAL 7 DAY
GROUP BY day
ORDER BY day;

Expected output (simplified):

+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+
| id | select_type | table     | type  | possible_keys             | key                       | key_len | ref  | rows | Extra       |
+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+
|  1 | SIMPLE      | pageviews | range | idx_pv_server_timestamp   | idx_pv_server_timestamp   | 5       | NULL |   15 | Using where |
+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+

Key things to look for in the EXPLAIN output:

Column What You Want to See
type range, ref, or eq_ref — not ALL (full table scan)
key The name of your index (e.g., idx_pv_server_timestamp)
rows A number much smaller than the total row count
Extra Using where is fine; Using filesort on large tables is a warning sign

Try EXPLAIN on a few more queries to confirm index usage:

-- Session lookup (should use idx_ev_session_id)
EXPLAIN SELECT * FROM events WHERE session_id = 'sess_abc123';

-- Error grouping (should use idx_err_message)
EXPLAIN SELECT error_message, COUNT(*) AS cnt
FROM errors
GROUP BY error_message
ORDER BY cnt DESC
LIMIT 10;

-- Performance by URL (should use idx_perf_url)
EXPLAIN SELECT url, AVG(lcp) AS avg_lcp, AVG(cls) AS avg_cls
FROM performance
GROUP BY url;
Small-table caveat: With only 30-100 rows of seed data, MySQL's optimizer may choose a full table scan over an index scan because it is faster for tiny tables. This is normal. The indexes will be used once the tables grow to thousands of rows. The important thing is that possible_keys lists your index — that confirms MySQL knows the index exists and could use it.