This module walks through the complete MySQL schema for the analytics pipeline — all 6 tables, their columns, indexes, and relationships. By the end you will have a working database ready for ingestion and reporting.
Before creating any tables we need a database. The analytics database uses utf8mb4 as the default character set, which supports the full Unicode range including emoji and CJK characters. The utf8mb4_unicode_ci collation provides case-insensitive comparisons with proper Unicode sorting.
CREATE DATABASE IF NOT EXISTS analytics
DEFAULT CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
USE analytics;
utf8 character set only supports 3-byte characters, which excludes many Unicode code points (including emoji). The utf8mb4 encoding uses up to 4 bytes per character, covering the entire Unicode standard. Always use utf8mb4 for new databases.
Every table that follows uses the InnoDB storage engine. InnoDB provides row-level locking, foreign key support, and crash-safe transactions — all essential for a system that ingests concurrent writes from a live collector endpoint.
The pageviews table is the primary fact table in the schema. Every beacon the collector receives becomes a row in this table. Let us walk through each column:
| Column | Type | Purpose |
|---|---|---|
id |
INT AUTO_INCREMENT | Surrogate primary key. Every row gets a unique integer. |
url |
VARCHAR(2048) | The page URL. 2048 characters accommodates long query strings. |
type |
ENUM('pageview','event','error','performance') | Beacon type. ENUM enforces valid values at the database level. |
user_agent |
VARCHAR(512) | Browser user-agent string. 512 chars covers even bloated UA strings. |
viewport_width |
SMALLINT UNSIGNED | Browser viewport width in pixels. Max 65,535 — more than enough. |
viewport_height |
SMALLINT UNSIGNED | Browser viewport height in pixels. |
referrer |
VARCHAR(2048) | Referring URL. Same length as url since referrers can be long. |
client_timestamp |
DATETIME | When the event occurred on the client. May be inaccurate (clock skew). |
server_timestamp |
DATETIME NOT NULL | When the server received the beacon. This is the canonical timestamp. |
client_ip |
VARCHAR(45) | Client IP. 45 chars fits both IPv4 and full IPv6 addresses. |
session_id |
VARCHAR(64) | Client-generated session identifier for sessionization. |
payload |
JSON | Flexible field for any extra data the beacon sends. |
created_at |
TIMESTAMP DEFAULT CURRENT_TIMESTAMP | Row insertion timestamp (database server time). |
Why VARCHAR(2048) for URLs? While most URLs are under 200 characters, query strings can get very long. Google search result URLs, UTM-tagged marketing links, and single-page apps with encoded state can easily reach 1000+ characters. The HTTP spec does not define a maximum URL length, but 2048 is the practical limit enforced by most browsers (Internet Explorer historically capped at 2,083 characters). Using 2048 ensures we never truncate a URL.
Why ENUM for type? An ENUM column stores values as integers internally (1 byte for up to 255 values), making it compact and fast to index. More importantly, it prevents invalid values from entering the database. If a beacon sends type: "foo", the INSERT will fail rather than silently storing garbage.
Why JSON for payload? The payload column holds arbitrary key-value pairs that vary by beacon type. A pageview might include scroll depth data; an event might include a clicked element's selector. MySQL 8.0's native JSON type validates the JSON on write and supports indexed virtual columns if you need to query specific keys later.
INDEX idx_pv_url (url(255)),
INDEX idx_pv_type (type),
INDEX idx_pv_server_timestamp (server_timestamp),
INDEX idx_pv_session_id (session_id)
The url(255) is a prefix index — MySQL indexes only the first 255 characters of the URL. This is a deliberate tradeoff: the index is smaller and faster, and most URL filtering queries match on the path portion (well under 255 characters). InnoDB limits index key length to 3072 bytes; indexing a full VARCHAR(2048) in utf8mb4 (4 bytes per character) would require 8192 bytes, which exceeds that limit.
url(255) works well for WHERE url LIKE '/about%' queries, but it cannot guarantee uniqueness or exact equality. For our analytics use case — filtering and grouping — prefix indexes are ideal.
The events table stores custom events: button clicks, scroll milestones, form submissions, video plays, or any interaction you want to track beyond simple pageviews.
| Column | Type | Purpose |
|---|---|---|
id |
INT AUTO_INCREMENT | Primary key |
session_id |
VARCHAR(64) | Links the event to a session |
event_name |
VARCHAR(128) NOT NULL | What happened: "click", "scroll_50", "form_submit" |
event_category |
VARCHAR(128) | Grouping label: "navigation", "engagement", "conversion" |
event_data |
JSON | Arbitrary metadata: {"button": "signup", "position": "header"} |
url |
VARCHAR(2048) | Page where the event occurred |
server_timestamp |
DATETIME NOT NULL | When the server received the event |
created_at |
TIMESTAMP DEFAULT CURRENT_TIMESTAMP | Row creation time |
The combination of event_name and event_category follows the same pattern used by Google Analytics. The event_name describes the specific action, while event_category groups related actions together for reporting.
INDEX idx_ev_event_name (event_name),
INDEX idx_ev_session_id (session_id),
INDEX idx_ev_server_timestamp (server_timestamp)
The index on event_name accelerates queries like SELECT COUNT(*) FROM events WHERE event_name = 'click'. The server_timestamp index supports time-range filtering, which is the most common query pattern in analytics dashboards (e.g., "show me events from the last 7 days").
The errors table captures JavaScript errors reported by the collector's error-tracking module. Each row represents one uncaught exception or window.onerror event.
| Column | Type | Purpose |
|---|---|---|
id |
INT AUTO_INCREMENT | Primary key |
session_id |
VARCHAR(64) | Links the error to a session |
error_message |
VARCHAR(1024) NOT NULL | The error message string (e.g., "Cannot read properties of undefined") |
error_source |
VARCHAR(2048) | Script URL that generated the error |
error_line |
INT UNSIGNED | Line number in the source file |
error_column |
INT UNSIGNED | Column number in the source file |
stack_trace |
TEXT | Full stack trace string |
url |
VARCHAR(2048) | Page URL where the error occurred |
user_agent |
VARCHAR(512) | Browser user-agent (useful for browser-specific bugs) |
server_timestamp |
DATETIME NOT NULL | When the server received the error report |
created_at |
TIMESTAMP DEFAULT CURRENT_TIMESTAMP | Row creation time |
Why TEXT for stack_trace instead of VARCHAR? Stack traces can be extremely long, especially with deeply nested call stacks or minified code where the entire bundle appears in the trace. TEXT supports up to 65,535 bytes, while VARCHAR maxes out at 65,535 bytes total for the entire row. Since this table already has several large VARCHAR columns (url, error_source, user_agent), using TEXT for stack_trace avoids hitting the row-size limit.
TEXT columns are stored off-page in InnoDB (only a 20-byte pointer is stored in the row), which keeps the table compact for queries that do not need the full stack trace.
INDEX idx_err_message (error_message(255)),
INDEX idx_err_session_id (session_id),
INDEX idx_err_server_timestamp (server_timestamp)
The prefix index error_message(255) is critical. Error messages often start with the same pattern ("TypeError: Cannot read..."), so indexing the first 255 characters captures enough to accelerate grouping by error type. You cannot index a full VARCHAR(1024) in utf8mb4 without exceeding InnoDB's key length limit.
The performance table stores both Navigation Timing API metrics and Core Web Vitals. This data comes from the collector's performance-timing and web-vitals modules.
| Column | Type | Purpose |
|---|---|---|
id |
INT AUTO_INCREMENT | Primary key |
session_id |
VARCHAR(64) | Links to the session |
url |
VARCHAR(2048) NOT NULL | Page that was measured |
ttfb |
INT UNSIGNED | Time to First Byte in milliseconds |
dom_content_loaded |
INT UNSIGNED | DOMContentLoaded event time in ms |
dom_complete |
INT UNSIGNED | domComplete time in ms |
load_time |
INT UNSIGNED | loadEventEnd time in ms |
lcp |
DECIMAL(10,2) | Largest Contentful Paint in ms |
cls |
DECIMAL(8,4) | Cumulative Layout Shift (unitless score) |
inp |
DECIMAL(10,2) | Interaction to Next Paint in ms |
fcp |
DECIMAL(10,2) | First Contentful Paint in ms |
transfer_size |
INT UNSIGNED | Total transfer size in bytes |
resource_count |
SMALLINT UNSIGNED | Number of resources loaded |
server_timestamp |
DATETIME NOT NULL | When the server received the data |
created_at |
TIMESTAMP DEFAULT CURRENT_TIMESTAMP | Row creation time |
Why DECIMAL for Web Vitals? Navigation Timing values (ttfb, dom_content_loaded, dom_complete, load_time) are whole-millisecond integers, so INT UNSIGNED is the right type. Web Vitals, however, are reported with fractional precision:
2345.670.1042. The DECIMAL(8,4) type stores 4 decimal places, which is the precision reported by the Web Vitals library.128.501823.40Using DECIMAL instead of FLOAT avoids floating-point rounding errors. When you compute percentiles (p75, p95) for performance dashboards, even small rounding errors can accumulate across thousands of rows. DECIMAL stores exact values.
The sessions table holds aggregated visit data. Each row represents one user session as determined by the sessionization module. This is a derived table — it is populated by server-side logic that groups pageviews using a 30-minute inactivity timeout.
| Column | Type | Purpose |
|---|---|---|
id |
INT AUTO_INCREMENT | Primary key |
session_id |
VARCHAR(64) NOT NULL | Client-generated session identifier |
first_page |
VARCHAR(2048) NOT NULL | Landing page (entry point of the visit) |
last_page |
VARCHAR(2048) NOT NULL | Most recent page viewed |
page_count |
INT UNSIGNED NOT NULL DEFAULT 1 | Total pages in the session |
start_time |
DATETIME NOT NULL | When the session began |
last_activity |
DATETIME NOT NULL | Timestamp of the most recent pageview |
duration_seconds |
INT UNSIGNED NOT NULL DEFAULT 0 | Elapsed time from first to last pageview |
referrer |
VARCHAR(2048) | First-touch referrer (traffic source) |
user_agent |
VARCHAR(512) | Browser user-agent string |
created_at |
TIMESTAMP DEFAULT CURRENT_TIMESTAMP | Row creation time |
The sessions table is updated in place as new pageviews arrive. When a pageview comes in, the sessionization logic checks whether an active session exists for that session_id. If the gap since last_activity is under 30 minutes, the existing row is updated (incrementing page_count, updating last_page and last_activity, recalculating duration_seconds). If the gap exceeds 30 minutes, a new row is inserted.
session_id can produce multiple rows in this table. If a user visits at 10:00 AM, leaves, and returns at 3:00 PM, the same session_id maps to two session rows with different start_time values.
The users table stores dashboard login accounts. These are not the visitors being tracked — they are the people who log in to the analytics dashboard to view reports.
| Column | Type | Purpose |
|---|---|---|
id |
INT AUTO_INCREMENT | Primary key |
email |
VARCHAR(255) NOT NULL UNIQUE | Login identifier. UNIQUE constraint prevents duplicate accounts. |
password_hash |
VARCHAR(255) NOT NULL | Bcrypt hash of the password. Never store plain text. |
display_name |
VARCHAR(128) NOT NULL | Name shown in the dashboard UI |
role |
ENUM('owner','admin','viewer') NOT NULL DEFAULT 'viewer' | Permission level |
created_at |
TIMESTAMP DEFAULT CURRENT_TIMESTAMP | When the account was created |
last_login |
DATETIME | Most recent successful login time |
Why VARCHAR(255) for password_hash? Bcrypt hashes are always 60 characters. So why 255? Because password hashing algorithms evolve. If you later migrate to Argon2id (which produces longer hashes), a 255-character column accommodates the change without an ALTER TABLE. It is a forward-compatibility buffer at negligible storage cost.
Role ENUM: The three roles define a simple permission hierarchy:
password_hash column stores the output of bcrypt (in PHP: password_hash(); in Node.js: bcrypt.hash()). A bcrypt hash looks like $2b$10$N9qo8uLOickgx2ZMRZoMy... — the $2b$10$ prefix indicates the bcrypt algorithm with a cost factor of 10. Never store raw passwords.
The seed.sql file contains INSERT statements that populate all 6 tables with realistic test data. To load it:
# First, create the schema
mysql -u root -p < schema.sql
# Then load seed data
mysql -u root -p analytics < seed.sql
After loading, verify that data was inserted correctly:
USE analytics;
SELECT 'pageviews' AS tbl, COUNT(*) AS rows FROM pageviews
UNION ALL
SELECT 'events', COUNT(*) FROM events
UNION ALL
SELECT 'errors', COUNT(*) FROM errors
UNION ALL
SELECT 'performance', COUNT(*) FROM performance
UNION ALL
SELECT 'sessions', COUNT(*) FROM sessions
UNION ALL
SELECT 'users', COUNT(*) FROM users;
Expected output:
+-------------+------+
| tbl | rows |
+-------------+------+
| pageviews | 30 |
| events | 10 |
| errors | 8 |
| performance | 15 |
| sessions | 5 |
| users | 3 |
+-------------+------+
You can also spot-check individual tables:
-- Check pageview distribution by type
SELECT type, COUNT(*) AS cnt
FROM pageviews
GROUP BY type;
-- Check session durations
SELECT session_id, page_count, duration_seconds,
first_page, last_page
FROM sessions
ORDER BY start_time;
Indexes only help if the query planner actually uses them. The EXPLAIN statement shows you the execution plan MySQL will use for a query — which indexes it selects, how many rows it estimates scanning, and whether it needs a full table scan.
Let us verify that our indexes work on a common analytics query — counting pageviews by day for the last 7 days:
EXPLAIN SELECT
DATE(server_timestamp) AS day,
COUNT(*) AS views
FROM pageviews
WHERE server_timestamp >= NOW() - INTERVAL 7 DAY
GROUP BY day
ORDER BY day;
Expected output (simplified):
+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+
| 1 | SIMPLE | pageviews | range | idx_pv_server_timestamp | idx_pv_server_timestamp | 5 | NULL | 15 | Using where |
+----+-------------+-----------+-------+---------------------------+---------------------------+---------+------+------+-------------+
Key things to look for in the EXPLAIN output:
| Column | What You Want to See |
|---|---|
type |
range, ref, or eq_ref — not ALL (full table scan) |
key |
The name of your index (e.g., idx_pv_server_timestamp) |
rows |
A number much smaller than the total row count |
Extra |
Using where is fine; Using filesort on large tables is a warning sign |
Try EXPLAIN on a few more queries to confirm index usage:
-- Session lookup (should use idx_ev_session_id)
EXPLAIN SELECT * FROM events WHERE session_id = 'sess_abc123';
-- Error grouping (should use idx_err_message)
EXPLAIN SELECT error_message, COUNT(*) AS cnt
FROM errors
GROUP BY error_message
ORDER BY cnt DESC
LIMIT 10;
-- Performance by URL (should use idx_perf_url)
EXPLAIN SELECT url, AVG(lcp) AS avg_lcp, AVG(cls) AS avg_cls
FROM performance
GROUP BY url;
possible_keys lists your index — that confirms MySQL knows the index exists and could use it.