Analytics data is only useful if it is stored in a structure that makes querying fast and reliable. This section covers MySQL schema design for analytics — tables, indexes, data types, and retention strategies.
A bad schema means slow queries, wasted disk, and frustrated dashboard users. Analytics queries are read-heavy and aggregate-heavy. The schema must optimize for GROUP BY, date ranges, and COUNT/AVG operations.
Consider the difference between a well-designed analytics schema and a naive one. A single pageviews table with no indexes on server_timestamp will require a full table scan every time someone loads the dashboard for "page views in the last 7 days." With a million rows, that query takes seconds. With ten million rows, it takes minutes. With proper indexes and partitioning, it takes milliseconds regardless of table size.
The schema decisions you make at the start of a project are among the hardest to change later. Adding an index is easy. Splitting a table, changing a primary key, or migrating from VARCHAR to ENUM requires downtime, data migration scripts, and careful testing. Get it right the first time.
The analytics database uses six tables. Each serves a distinct purpose in the pipeline:
| Table | Purpose |
|---|---|
pageviews |
Primary fact table — one row per page load, capturing URL, viewport, referrer, and timestamps |
events |
Custom interaction events from collector plugins (clicks, scrolls, form submissions) |
errors |
JavaScript errors captured by the collector's error tracking module |
performance |
Navigation Timing metrics and Core Web Vitals (LCP, CLS, INP) per page load |
sessions |
Aggregated visit data — entry/exit pages, page count, duration |
users |
Dashboard login accounts (NOT analytics visitors) — authentication and role management |
This is the primary fact table — the central table that most dashboard queries will hit. Every page load produces exactly one row here. It captures the data the collector sends in its initial beacon plus the enrichment fields added by the server.
| Column | Type | Why It Exists |
|---|---|---|
id |
INT UNSIGNED AUTO_INCREMENT | Surrogate primary key for fast lookups and JOINs |
url |
VARCHAR(2048) | The page URL visited — the most queried column for "top pages" reports |
type |
VARCHAR(32) | Beacon type identifier (pageview, spa_navigation) for filtering event categories |
user_agent |
VARCHAR(512) | Raw User-Agent string from the server (not client-reported) for browser/OS breakdowns |
viewport_width |
SMALLINT UNSIGNED | Browser viewport width in pixels — used for device-type analysis (mobile vs desktop) |
viewport_height |
SMALLINT UNSIGNED | Browser viewport height in pixels — paired with width for screen-size distributions |
referrer |
VARCHAR(2048) | The referring URL — where the visitor came from. Empty string for direct visits |
client_timestamp |
BIGINT | Unix epoch in milliseconds from the client — may be inaccurate but useful for clock-drift analysis |
server_timestamp |
DATETIME | Authoritative timestamp set by the server at beacon receipt — the column you query for date ranges |
client_ip |
VARCHAR(45) | IPv4 or IPv6 address for geolocation lookups (45 chars covers the longest IPv6 representation) |
session_id |
VARCHAR(36) | UUID linking this pageview to a session record — the JOIN key for session-level queries |
payload |
JSON | Catch-all for extra data the collector sends that does not map to a dedicated column — avoids schema changes for new fields |
created_at |
DATETIME DEFAULT CURRENT_TIMESTAMP | Row insertion time for auditing — usually identical to server_timestamp but decoupled for clarity |
CREATE TABLE pageviews (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
url VARCHAR(2048) NOT NULL,
type VARCHAR(32) NOT NULL DEFAULT 'pageview',
user_agent VARCHAR(512),
viewport_width SMALLINT UNSIGNED,
viewport_height SMALLINT UNSIGNED,
referrer VARCHAR(2048),
client_timestamp BIGINT,
server_timestamp DATETIME NOT NULL,
client_ip VARCHAR(45),
session_id VARCHAR(36) NOT NULL,
payload JSON,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
INDEX idx_server_ts (server_timestamp),
INDEX idx_session (session_id),
INDEX idx_url_ts (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
url(255) in the index? MySQL InnoDB limits index key prefixes to 3072 bytes. A VARCHAR(2048) column in utf8mb4 would require 8192 bytes, far exceeding the limit. Indexing the first 255 characters covers the vast majority of real-world URLs while staying within bounds.
Custom events come from collector plugins — click tracking, scroll depth measurement, form submission monitoring, and any other interaction you want to capture. Each event is a discrete action that happened on a page, linked to a session.
| Column | Type | Purpose |
|---|---|---|
id |
INT UNSIGNED AUTO_INCREMENT | Primary key |
session_id |
VARCHAR(36) | Links to the session this event belongs to |
event_name |
VARCHAR(128) | Machine-readable event identifier (e.g., click, scroll_depth, form_submit) |
event_category |
VARCHAR(128) | Grouping label for dashboards (e.g., interaction, engagement, conversion) |
event_data |
JSON | Flexible payload — click coordinates, scroll percentage, form field count, etc. |
url |
VARCHAR(2048) | The page where the event occurred |
server_timestamp |
DATETIME | Authoritative time of event receipt |
created_at |
DATETIME DEFAULT CURRENT_TIMESTAMP | Row insertion time |
CREATE TABLE events (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
session_id VARCHAR(36) NOT NULL,
event_name VARCHAR(128) NOT NULL,
event_category VARCHAR(128),
event_data JSON,
url VARCHAR(2048),
server_timestamp DATETIME NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
INDEX idx_session (session_id),
INDEX idx_event_name (event_name),
INDEX idx_server_ts (server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The event_data JSON column is the key design decision here. Rather than creating separate columns for every possible event payload (click x/y coordinates, scroll depth percentage, form field name), a JSON column keeps the schema stable while allowing each event type to carry different data. Dashboard queries can use MySQL's JSON_EXTRACT() function to pull specific values when needed.
JavaScript errors captured by the collector's error tracking module. Each row represents a single error occurrence on a visitor's browser. This table powers the "Errors" section of the dashboard, showing error frequency, affected pages, and browser distributions.
| Column | Type | Purpose |
|---|---|---|
id |
INT UNSIGNED AUTO_INCREMENT | Primary key |
session_id |
VARCHAR(36) | Links error to a session for per-visit error analysis |
error_message |
VARCHAR(1024) | The error message (e.g., TypeError: Cannot read properties of null) |
error_source |
VARCHAR(2048) | Script URL where the error originated |
error_line |
INT UNSIGNED | Line number in the source file |
error_column |
INT UNSIGNED | Column number in the source file |
stack_trace |
TEXT | Full stack trace — can be very long, so TEXT instead of VARCHAR |
url |
VARCHAR(2048) | The page the user was on when the error occurred |
user_agent |
VARCHAR(512) | Browser identification for "which browsers produce this error?" analysis |
server_timestamp |
DATETIME | Authoritative time of error receipt |
created_at |
DATETIME DEFAULT CURRENT_TIMESTAMP | Row insertion time |
CREATE TABLE errors (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
session_id VARCHAR(36) NOT NULL,
error_message VARCHAR(1024),
error_source VARCHAR(2048),
error_line INT UNSIGNED,
error_column INT UNSIGNED,
stack_trace TEXT,
url VARCHAR(2048),
user_agent VARCHAR(512),
server_timestamp DATETIME NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
INDEX idx_session (session_id),
INDEX idx_server_ts (server_timestamp),
INDEX idx_message (error_message(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
user_agent in the errors table? Many JavaScript errors are browser-specific. A TypeError that only appears in Safari 16 or a ReferenceError caused by a missing API in older Firefox builds becomes immediately visible when you can group errors by User-Agent. Duplicating this column avoids a JOIN to the pageviews table for the most common error-analysis queries.
Navigation Timing metrics and Core Web Vitals collected by the performance timing and web vitals modules. Each row corresponds to a single page load and captures the full timing breakdown.
| Column | Type | Purpose |
|---|---|---|
id |
INT UNSIGNED AUTO_INCREMENT | Primary key |
session_id |
VARCHAR(36) | Links to session for per-visit performance analysis |
url |
VARCHAR(2048) | The page whose performance was measured |
ttfb |
FLOAT | Time to First Byte in ms — server responsiveness |
dom_content_loaded |
FLOAT | DOMContentLoaded event time in ms — HTML parsed, DOM ready |
dom_complete |
FLOAT | DOM complete time in ms — all subresources finished loading |
load_time |
FLOAT | Full page load time in ms (load event fired) |
lcp |
FLOAT | Largest Contentful Paint in ms — Core Web Vital |
cls |
FLOAT | Cumulative Layout Shift (unitless) — Core Web Vital |
inp |
FLOAT | Interaction to Next Paint in ms — Core Web Vital |
fcp |
FLOAT | First Contentful Paint in ms — first visible content rendered |
transfer_size |
INT UNSIGNED | Total bytes transferred for the page and its resources |
resource_count |
SMALLINT UNSIGNED | Number of resources loaded (scripts, stylesheets, images, etc.) |
server_timestamp |
DATETIME | Authoritative time of beacon receipt |
created_at |
DATETIME DEFAULT CURRENT_TIMESTAMP | Row insertion time |
CREATE TABLE performance (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
session_id VARCHAR(36) NOT NULL,
url VARCHAR(2048),
ttfb FLOAT,
dom_content_loaded FLOAT,
dom_complete FLOAT,
load_time FLOAT,
lcp FLOAT,
cls FLOAT,
inp FLOAT,
fcp FLOAT,
transfer_size INT UNSIGNED,
resource_count SMALLINT UNSIGNED,
server_timestamp DATETIME NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
INDEX idx_session (session_id),
INDEX idx_server_ts (server_timestamp),
INDEX idx_url_ts (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
All timing columns use FLOAT because sub-millisecond precision matters for Web Vitals (CLS values like 0.042 are common). FLOAT provides adequate precision for timing values while using only 4 bytes per column — half the size of DOUBLE.
The sessions table stores aggregated visit data computed by the sessionization module. Rather than recalculating session-level metrics from raw pageviews every time the dashboard loads, this table pre-computes them. It is updated incrementally as new beacons arrive for an active session.
| Column | Type | Purpose |
|---|---|---|
session_id |
VARCHAR(36) PRIMARY KEY | UUID generated at session start — the primary key, not auto-increment |
first_page |
VARCHAR(2048) | Landing page URL — where the visitor entered the site |
last_page |
VARCHAR(2048) | Exit page URL — the last page visited (updated with each beacon) |
page_count |
SMALLINT UNSIGNED | Number of pages viewed in this session (incremented per pageview beacon) |
start_time |
DATETIME | Timestamp of the first beacon in this session |
last_activity |
DATETIME | Timestamp of the most recent beacon (used for 30-minute timeout logic) |
duration_seconds |
INT UNSIGNED | Elapsed time from first to last beacon — computed as TIMESTAMPDIFF(SECOND, start_time, last_activity) |
referrer |
VARCHAR(2048) | Referrer from the first beacon — traffic source attribution |
user_agent |
VARCHAR(512) | User-Agent from the first beacon — device/browser identification |
CREATE TABLE sessions (
session_id VARCHAR(36) PRIMARY KEY,
first_page VARCHAR(2048),
last_page VARCHAR(2048),
page_count SMALLINT UNSIGNED DEFAULT 1,
start_time DATETIME NOT NULL,
last_activity DATETIME NOT NULL,
duration_seconds INT UNSIGNED DEFAULT 0,
referrer VARCHAR(2048),
user_agent VARCHAR(512),
INDEX idx_start_time (start_time),
INDEX idx_last_activity (last_activity)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Notice that session_id is the primary key, not an auto-increment integer. This is because session IDs are generated by the server-processing pipeline (as UUIDs) and must be referenced by other tables before the session row may even exist. Using the UUID directly as the PK avoids a lookup to find the auto-increment ID.
session_id already exists, update last_page, last_activity, page_count, and duration_seconds. If it does not exist, insert a new row.
This table has nothing to do with analytics visitors. It stores dashboard login accounts — the people who log into your analytics dashboard to view reports. It is a standard authentication table with role-based access control.
| Column | Type | Purpose |
|---|---|---|
id |
INT UNSIGNED AUTO_INCREMENT | Primary key |
email |
VARCHAR(255) UNIQUE | Login identifier — must be unique across all accounts |
password_hash |
VARCHAR(255) | Bcrypt or Argon2 hash — never store plaintext passwords |
display_name |
VARCHAR(128) | Human-readable name shown in the dashboard UI |
role |
ENUM('owner','admin','viewer') | Access level: owner has full control, admin can manage settings, viewer is read-only |
created_at |
DATETIME DEFAULT CURRENT_TIMESTAMP | Account creation time |
last_login |
DATETIME | Last successful login — useful for auditing and identifying inactive accounts |
CREATE TABLE users (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
password_hash VARCHAR(255) NOT NULL,
display_name VARCHAR(128),
role ENUM('owner', 'admin', 'viewer') NOT NULL DEFAULT 'viewer',
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
last_login DATETIME
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
password_hash column stores the output of password_hash() in PHP or bcrypt.hash() in Node.js. Never store plaintext passwords. Never use MD5 or SHA-256 for password hashing — these are fast hashes designed for integrity checks, not password storage. Use bcrypt, scrypt, or Argon2, which are intentionally slow to resist brute-force attacks.
Without indexes, every query performs a full table scan. For a pageviews table with millions of rows, that means reading every row from disk to answer a simple question like "how many page views happened yesterday." Indexes are the single most important performance optimization for analytics databases.
An index is a sorted copy of one or more columns stored in a B-tree structure. When MySQL receives a query with a WHERE clause on an indexed column, it traverses the B-tree to find matching rows in O(log n) time instead of scanning all n rows. For a table with 10 million rows, that is the difference between examining ~23 rows (log2 of 10M) versus 10 million rows.
A covering index contains all the columns a query needs, so MySQL can answer the query entirely from the index without touching the table data. This is the fastest possible query execution.
-- This query needs: server_timestamp (WHERE), url (SELECT), COUNT (aggregate) -- A covering index on (server_timestamp, url) handles it entirely from the index: CREATE INDEX idx_ts_url ON pageviews (server_timestamp, url(255)); SELECT url, COUNT(*) as views FROM pageviews WHERE server_timestamp BETWEEN '2024-01-01' AND '2024-01-07' GROUP BY url ORDER BY views DESC;
| Table | Index | Supports |
|---|---|---|
pageviews |
(server_timestamp) |
Date-range queries, "page views per day" charts |
pageviews |
(session_id) |
Session-level aggregation, "pages per session" queries |
pageviews |
(url(255), server_timestamp) |
Per-page traffic over time, "top pages this week" |
events |
(event_name) |
"Click count by event type" queries |
events |
(server_timestamp) |
Event volume over time |
errors |
(server_timestamp) |
Error rate over time |
errors |
(error_message(255)) |
Grouping errors by message for "top errors" report |
performance |
(server_timestamp) |
Performance trends over time |
performance |
(url(255), server_timestamp) |
Per-page performance breakdown |
sessions |
(start_time) |
Session count per day, "unique visitors" approximation |
Always verify that MySQL actually uses your indexes. The EXPLAIN statement shows the query execution plan:
EXPLAIN SELECT url, COUNT(*) as views FROM pageviews WHERE server_timestamp BETWEEN '2024-01-01' AND '2024-01-07' GROUP BY url; -- Look for: -- type: range (using index for range scan, good) -- key: idx_server_ts (your index is being used) -- rows: 8432 (estimated rows examined, should be << total rows) -- -- Red flags: -- type: ALL (full table scan, no index used) -- key: NULL (no index selected) -- rows: 5000000 (examining most of the table)
INSERT must update every index on the table. For analytics workloads where data is written once and read many times, the tradeoff is overwhelmingly worth it. Do not be afraid to add indexes on analytics tables.
In a traditional application database, you would normalize aggressively: extract user_agent strings into a separate user_agents table with a foreign key, store URLs in a urls table, and reference them by ID. This eliminates data duplication and keeps the database compact.
Analytics schemas take a different approach. They are often intentionally denormalized for read performance.
| Consideration | Normalized | Denormalized |
|---|---|---|
| Query complexity | JOINs required for every dashboard query | Single-table queries, no JOINs |
| Query speed | Slower — JOINs add latency, especially with millions of rows | Faster — data is already in the row |
| Disk usage | Smaller — shared strings stored once | Larger — strings duplicated per row |
| Write speed | Slower — may need INSERT into multiple tables per beacon | Faster — single INSERT per beacon |
| Schema maintenance | More tables, more foreign keys, more complexity | Fewer tables, simpler schema |
Normalization still makes sense in specific cases:
user_agents table with a SMALLINT FK can save significant disk space.Choosing the right MySQL data type for each column affects storage efficiency, query performance, and data integrity. Here are the key decisions for an analytics schema:
| Type | Max Size | Indexable | Use When |
|---|---|---|---|
VARCHAR(n) |
65,535 bytes | Yes (with prefix length) | Bounded-length strings: URLs, user agents, error messages |
TEXT |
65,535 bytes | Only with prefix index | Unbounded strings: stack traces, raw payloads |
Use VARCHAR when you know the maximum length and want to enforce it at the database level. Use TEXT for columns like stack_trace where the content length is unpredictable and you do not need to index the full value.
| Type | Size | Max Value (UNSIGNED) | Use When |
|---|---|---|---|
INT UNSIGNED |
4 bytes | 4,294,967,295 (~4.3 billion) | Auto-increment PKs, counts, most analytics IDs |
BIGINT |
8 bytes | ~9.2 quintillion | Unix timestamps in milliseconds, very high-volume counters |
SMALLINT UNSIGNED |
2 bytes | 65,535 | Viewport dimensions, resource counts, page counts |
The client_timestamp column uses BIGINT because JavaScript's Date.now() returns milliseconds since epoch (values like 1706745600000), which exceeds INT's 4.3 billion maximum.
| Type | Range | Storage | Timezone Behavior |
|---|---|---|---|
DATETIME |
1000-01-01 to 9999-12-31 | 8 bytes | Stored as-is, no conversion |
TIMESTAMP |
1970-01-01 to 2038-01-19 | 4 bytes | Stored as UTC, converted to session timezone on read |
Use DATETIME for analytics timestamps. The TIMESTAMP type's 2038 limit is a real concern for data you plan to retain long-term, and the automatic timezone conversion can cause confusion when querying across timezones. Store everything in UTC and handle timezone conversion in the dashboard application layer.
MySQL 5.7+ supports a native JSON column type. It validates that inserted data is valid JSON, supports indexing via generated columns, and provides functions like JSON_EXTRACT(), JSON_UNQUOTE(), and ->> (shorthand for extract + unquote).
-- Extract a value from the JSON payload column:
SELECT payload->>'$.browser' AS browser
FROM pageviews
WHERE server_timestamp > '2024-01-01';
-- Create a generated column + index for frequent JSON queries:
ALTER TABLE events
ADD COLUMN scroll_depth INT
GENERATED ALWAYS AS (JSON_EXTRACT(event_data, '$.scroll_pct')) VIRTUAL,
ADD INDEX idx_scroll_depth (scroll_depth);
Always use utf8mb4, not utf8. MySQL's utf8 is a legacy alias that only supports 3-byte characters, which cannot store emoji or some CJK characters. utf8mb4 is true UTF-8 with full 4-byte support. User-Agent strings, URLs, and page titles can all contain characters that require 4 bytes.
Use ENUM for columns with a small, fixed set of possible values. MySQL stores ENUM values as 1-2 byte integers internally, making them more storage-efficient and type-safe than VARCHAR:
-- ENUM for user roles (3 possible values, stored as 1 byte each):
role ENUM('owner', 'admin', 'viewer') NOT NULL DEFAULT 'viewer'
-- ENUM for beacon types (if your collector uses a fixed set):
type ENUM('pageview', 'spa_navigation', 'event', 'error', 'performance')
As your analytics tables grow beyond millions of rows, even indexed queries can slow down because MySQL must scan a large index tree. RANGE partitioning on the server_timestamp column solves this by physically splitting the table into separate partitions, one per month (or week, or day).
When MySQL receives a query with a WHERE server_timestamp BETWEEN ... clause, it performs partition pruning — it skips partitions that cannot contain matching rows. A query for "last 7 days" on a table with 12 monthly partitions only scans 1 partition instead of all 12. This is automatic and requires no changes to your queries.
CREATE TABLE pageviews (
id INT UNSIGNED AUTO_INCREMENT,
url VARCHAR(2048) NOT NULL,
type VARCHAR(32) NOT NULL DEFAULT 'pageview',
user_agent VARCHAR(512),
viewport_width SMALLINT UNSIGNED,
viewport_height SMALLINT UNSIGNED,
referrer VARCHAR(2048),
client_timestamp BIGINT,
server_timestamp DATETIME NOT NULL,
client_ip VARCHAR(45),
session_id VARCHAR(36) NOT NULL,
payload JSON,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id, server_timestamp),
INDEX idx_session (session_id),
INDEX idx_url_ts (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (TO_DAYS(server_timestamp)) (
PARTITION p2024_01 VALUES LESS THAN (TO_DAYS('2024-02-01')),
PARTITION p2024_02 VALUES LESS THAN (TO_DAYS('2024-03-01')),
PARTITION p2024_03 VALUES LESS THAN (TO_DAYS('2024-04-01')),
PARTITION p2024_04 VALUES LESS THAN (TO_DAYS('2024-05-01')),
PARTITION p2024_05 VALUES LESS THAN (TO_DAYS('2024-06-01')),
PARTITION p2024_06 VALUES LESS THAN (TO_DAYS('2024-07-01')),
PARTITION p_future VALUES LESS THAN MAXVALUE
);
(id, server_timestamp) instead of just (id).
The biggest operational advantage of partitioning is data retention. When you need to delete old data, dropping a partition is nearly instantaneous, while a DELETE statement on millions of rows is painfully slow and generates enormous redo logs:
| Approach | Operation | Speed | Impact |
|---|---|---|---|
DELETE |
DELETE FROM pageviews WHERE server_timestamp < '2024-01-01' |
Minutes to hours | Locks rows, generates redo log, fragments table, requires OPTIMIZE TABLE after |
DROP PARTITION |
ALTER TABLE pageviews DROP PARTITION p2023_12 |
Milliseconds | Drops the partition file instantly, no row-level locking, no fragmentation |
-- Delete 3 months of data in milliseconds:
ALTER TABLE pageviews DROP PARTITION p2024_01, p2024_02, p2024_03;
-- Add a new partition for next month:
ALTER TABLE pageviews REORGANIZE PARTITION p_future INTO (
PARTITION p2024_07 VALUES LESS THAN (TO_DAYS('2024-08-01')),
PARTITION p_future VALUES LESS THAN MAXVALUE
);
Create the MySQL database and all six tables with proper data types, indexes, and character encoding. Write and execute the full schema DDL.
Start Module →Write the SQL queries that power the dashboard: page views over time, top pages, session counts, performance percentiles, and error rankings.
Start Module →Implement date-range partitioning, write a cron job to rotate partitions monthly, and configure a retention policy to drop data older than N months.
Start Module →