Storage & Schema Design

Analytics data is only useful if it is stored in a structure that makes querying fast and reliable. This section covers MySQL schema design for analytics — tables, indexes, data types, and retention strategies.

1. Why Schema Matters

A bad schema means slow queries, wasted disk, and frustrated dashboard users. Analytics queries are read-heavy and aggregate-heavy. The schema must optimize for GROUP BY, date ranges, and COUNT/AVG operations.

Consider the difference between a well-designed analytics schema and a naive one. A single pageviews table with no indexes on server_timestamp will require a full table scan every time someone loads the dashboard for "page views in the last 7 days." With a million rows, that query takes seconds. With ten million rows, it takes minutes. With proper indexes and partitioning, it takes milliseconds regardless of table size.

The schema decisions you make at the start of a project are among the hardest to change later. Adding an index is easy. Splitting a table, changing a primary key, or migrating from VARCHAR to ENUM requires downtime, data migration scripts, and careful testing. Get it right the first time.

Analytics schemas are different from application schemas. In a typical web application, the schema is normalized, optimized for writes, and designed around entities (users, orders, products). An analytics schema is optimized for reads, designed around events, and often intentionally denormalized to avoid expensive JOINs at query time.

2. Core Tables Overview

The analytics database uses six tables. Each serves a distinct purpose in the pipeline:

Table Purpose
pageviews Primary fact table — one row per page load, capturing URL, viewport, referrer, and timestamps
events Custom interaction events from collector plugins (clicks, scrolls, form submissions)
errors JavaScript errors captured by the collector's error tracking module
performance Navigation Timing metrics and Core Web Vitals (LCP, CLS, INP) per page load
sessions Aggregated visit data — entry/exit pages, page count, duration
users Dashboard login accounts (NOT analytics visitors) — authentication and role management
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ sessions │ │ pageviews │ │ events │ ├──────────────┤ ├──────────────┤ ├──────────────┤ │ session_id │<──┬──>│ session_id │ │ session_id │──┐ │ first_page │ │ │ id (PK) │ │ id (PK) │ │ │ last_page │ │ │ url │ │ event_name │ │ │ page_count │ │ │ type │ │ event_category│ │ │ start_time │ │ │ user_agent │ │ event_data │ │ │ last_activity│ │ │ viewport_w/h │ │ url │ │ │ duration_s │ │ │ referrer │ │ server_ts │ │ │ referrer │ │ │ client_ts │ │ created_at │ │ │ user_agent │ │ │ server_ts │ └──────────────┘ │ └──────────────┘ │ │ client_ip │ │ │ │ payload │ │ │ │ created_at │ │ │ └──────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ errors │ │ performance │ │ │ ├──────────────┤ ├──────────────┤ │ ├──>│ session_id │ │ session_id │<─┘ │ │ id (PK) │ │ id (PK) │ │ │ error_message│ │ url │ │ │ error_source │ │ ttfb │ │ │ error_line │ │ dom_complete │ │ │ stack_trace │ │ load_time │ │ │ url │ │ lcp, cls, inp│ │ │ server_ts │ │ fcp │ │ │ created_at │ │ transfer_size│ │ └──────────────┘ │ server_ts │ │ │ created_at │ │ └──────────────┘ │ │ ┌──────────────┐ │ │ users │ (dashboard accounts, │ ├──────────────┤ no FK to analytics │ │ id (PK) │ tables) │ │ email │ │ │ password_hash│ │ │ display_name │ │ │ role (ENUM) │ │ │ created_at │ │ │ last_login │ │ └──────────────┘ session_id links pageviews, events, errors, and performance to a single session record. The users table is independent — it stores dashboard accounts, not analytics visitors.

3. The pageviews Table

This is the primary fact table — the central table that most dashboard queries will hit. Every page load produces exactly one row here. It captures the data the collector sends in its initial beacon plus the enrichment fields added by the server.

Column Type Why It Exists
id INT UNSIGNED AUTO_INCREMENT Surrogate primary key for fast lookups and JOINs
url VARCHAR(2048) The page URL visited — the most queried column for "top pages" reports
type VARCHAR(32) Beacon type identifier (pageview, spa_navigation) for filtering event categories
user_agent VARCHAR(512) Raw User-Agent string from the server (not client-reported) for browser/OS breakdowns
viewport_width SMALLINT UNSIGNED Browser viewport width in pixels — used for device-type analysis (mobile vs desktop)
viewport_height SMALLINT UNSIGNED Browser viewport height in pixels — paired with width for screen-size distributions
referrer VARCHAR(2048) The referring URL — where the visitor came from. Empty string for direct visits
client_timestamp BIGINT Unix epoch in milliseconds from the client — may be inaccurate but useful for clock-drift analysis
server_timestamp DATETIME Authoritative timestamp set by the server at beacon receipt — the column you query for date ranges
client_ip VARCHAR(45) IPv4 or IPv6 address for geolocation lookups (45 chars covers the longest IPv6 representation)
session_id VARCHAR(36) UUID linking this pageview to a session record — the JOIN key for session-level queries
payload JSON Catch-all for extra data the collector sends that does not map to a dedicated column — avoids schema changes for new fields
created_at DATETIME DEFAULT CURRENT_TIMESTAMP Row insertion time for auditing — usually identical to server_timestamp but decoupled for clarity
CREATE TABLE pageviews (
    id              INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    url             VARCHAR(2048) NOT NULL,
    type            VARCHAR(32) NOT NULL DEFAULT 'pageview',
    user_agent      VARCHAR(512),
    viewport_width  SMALLINT UNSIGNED,
    viewport_height SMALLINT UNSIGNED,
    referrer        VARCHAR(2048),
    client_timestamp BIGINT,
    server_timestamp DATETIME NOT NULL,
    client_ip       VARCHAR(45),
    session_id      VARCHAR(36) NOT NULL,
    payload         JSON,
    created_at      DATETIME DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_server_ts (server_timestamp),
    INDEX idx_session   (session_id),
    INDEX idx_url_ts    (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Why url(255) in the index? MySQL InnoDB limits index key prefixes to 3072 bytes. A VARCHAR(2048) column in utf8mb4 would require 8192 bytes, far exceeding the limit. Indexing the first 255 characters covers the vast majority of real-world URLs while staying within bounds.

4. The events Table

Custom events come from collector plugins — click tracking, scroll depth measurement, form submission monitoring, and any other interaction you want to capture. Each event is a discrete action that happened on a page, linked to a session.

Column Type Purpose
id INT UNSIGNED AUTO_INCREMENT Primary key
session_id VARCHAR(36) Links to the session this event belongs to
event_name VARCHAR(128) Machine-readable event identifier (e.g., click, scroll_depth, form_submit)
event_category VARCHAR(128) Grouping label for dashboards (e.g., interaction, engagement, conversion)
event_data JSON Flexible payload — click coordinates, scroll percentage, form field count, etc.
url VARCHAR(2048) The page where the event occurred
server_timestamp DATETIME Authoritative time of event receipt
created_at DATETIME DEFAULT CURRENT_TIMESTAMP Row insertion time
CREATE TABLE events (
    id               INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    session_id       VARCHAR(36) NOT NULL,
    event_name       VARCHAR(128) NOT NULL,
    event_category   VARCHAR(128),
    event_data       JSON,
    url              VARCHAR(2048),
    server_timestamp DATETIME NOT NULL,
    created_at       DATETIME DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_session    (session_id),
    INDEX idx_event_name (event_name),
    INDEX idx_server_ts  (server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

The event_data JSON column is the key design decision here. Rather than creating separate columns for every possible event payload (click x/y coordinates, scroll depth percentage, form field name), a JSON column keeps the schema stable while allowing each event type to carry different data. Dashboard queries can use MySQL's JSON_EXTRACT() function to pull specific values when needed.

5. The errors Table

JavaScript errors captured by the collector's error tracking module. Each row represents a single error occurrence on a visitor's browser. This table powers the "Errors" section of the dashboard, showing error frequency, affected pages, and browser distributions.

Column Type Purpose
id INT UNSIGNED AUTO_INCREMENT Primary key
session_id VARCHAR(36) Links error to a session for per-visit error analysis
error_message VARCHAR(1024) The error message (e.g., TypeError: Cannot read properties of null)
error_source VARCHAR(2048) Script URL where the error originated
error_line INT UNSIGNED Line number in the source file
error_column INT UNSIGNED Column number in the source file
stack_trace TEXT Full stack trace — can be very long, so TEXT instead of VARCHAR
url VARCHAR(2048) The page the user was on when the error occurred
user_agent VARCHAR(512) Browser identification for "which browsers produce this error?" analysis
server_timestamp DATETIME Authoritative time of error receipt
created_at DATETIME DEFAULT CURRENT_TIMESTAMP Row insertion time
CREATE TABLE errors (
    id               INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    session_id       VARCHAR(36) NOT NULL,
    error_message    VARCHAR(1024),
    error_source     VARCHAR(2048),
    error_line       INT UNSIGNED,
    error_column     INT UNSIGNED,
    stack_trace      TEXT,
    url              VARCHAR(2048),
    user_agent       VARCHAR(512),
    server_timestamp DATETIME NOT NULL,
    created_at       DATETIME DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_session   (session_id),
    INDEX idx_server_ts (server_timestamp),
    INDEX idx_message   (error_message(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Why store user_agent in the errors table? Many JavaScript errors are browser-specific. A TypeError that only appears in Safari 16 or a ReferenceError caused by a missing API in older Firefox builds becomes immediately visible when you can group errors by User-Agent. Duplicating this column avoids a JOIN to the pageviews table for the most common error-analysis queries.

6. The performance Table

Navigation Timing metrics and Core Web Vitals collected by the performance timing and web vitals modules. Each row corresponds to a single page load and captures the full timing breakdown.

Column Type Purpose
id INT UNSIGNED AUTO_INCREMENT Primary key
session_id VARCHAR(36) Links to session for per-visit performance analysis
url VARCHAR(2048) The page whose performance was measured
ttfb FLOAT Time to First Byte in ms — server responsiveness
dom_content_loaded FLOAT DOMContentLoaded event time in ms — HTML parsed, DOM ready
dom_complete FLOAT DOM complete time in ms — all subresources finished loading
load_time FLOAT Full page load time in ms (load event fired)
lcp FLOAT Largest Contentful Paint in ms — Core Web Vital
cls FLOAT Cumulative Layout Shift (unitless) — Core Web Vital
inp FLOAT Interaction to Next Paint in ms — Core Web Vital
fcp FLOAT First Contentful Paint in ms — first visible content rendered
transfer_size INT UNSIGNED Total bytes transferred for the page and its resources
resource_count SMALLINT UNSIGNED Number of resources loaded (scripts, stylesheets, images, etc.)
server_timestamp DATETIME Authoritative time of beacon receipt
created_at DATETIME DEFAULT CURRENT_TIMESTAMP Row insertion time
CREATE TABLE performance (
    id                  INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    session_id          VARCHAR(36) NOT NULL,
    url                 VARCHAR(2048),
    ttfb                FLOAT,
    dom_content_loaded  FLOAT,
    dom_complete        FLOAT,
    load_time           FLOAT,
    lcp                 FLOAT,
    cls                 FLOAT,
    inp                 FLOAT,
    fcp                 FLOAT,
    transfer_size       INT UNSIGNED,
    resource_count      SMALLINT UNSIGNED,
    server_timestamp    DATETIME NOT NULL,
    created_at          DATETIME DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_session   (session_id),
    INDEX idx_server_ts (server_timestamp),
    INDEX idx_url_ts    (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

All timing columns use FLOAT because sub-millisecond precision matters for Web Vitals (CLS values like 0.042 are common). FLOAT provides adequate precision for timing values while using only 4 bytes per column — half the size of DOUBLE.

7. The sessions Table

The sessions table stores aggregated visit data computed by the sessionization module. Rather than recalculating session-level metrics from raw pageviews every time the dashboard loads, this table pre-computes them. It is updated incrementally as new beacons arrive for an active session.

Column Type Purpose
session_id VARCHAR(36) PRIMARY KEY UUID generated at session start — the primary key, not auto-increment
first_page VARCHAR(2048) Landing page URL — where the visitor entered the site
last_page VARCHAR(2048) Exit page URL — the last page visited (updated with each beacon)
page_count SMALLINT UNSIGNED Number of pages viewed in this session (incremented per pageview beacon)
start_time DATETIME Timestamp of the first beacon in this session
last_activity DATETIME Timestamp of the most recent beacon (used for 30-minute timeout logic)
duration_seconds INT UNSIGNED Elapsed time from first to last beacon — computed as TIMESTAMPDIFF(SECOND, start_time, last_activity)
referrer VARCHAR(2048) Referrer from the first beacon — traffic source attribution
user_agent VARCHAR(512) User-Agent from the first beacon — device/browser identification
CREATE TABLE sessions (
    session_id       VARCHAR(36) PRIMARY KEY,
    first_page       VARCHAR(2048),
    last_page        VARCHAR(2048),
    page_count       SMALLINT UNSIGNED DEFAULT 1,
    start_time       DATETIME NOT NULL,
    last_activity    DATETIME NOT NULL,
    duration_seconds INT UNSIGNED DEFAULT 0,
    referrer         VARCHAR(2048),
    user_agent       VARCHAR(512),

    INDEX idx_start_time    (start_time),
    INDEX idx_last_activity (last_activity)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Notice that session_id is the primary key, not an auto-increment integer. This is because session IDs are generated by the server-processing pipeline (as UUIDs) and must be referenced by other tables before the session row may even exist. Using the UUID directly as the PK avoids a lookup to find the auto-increment ID.

INSERT ... ON DUPLICATE KEY UPDATE: The sessionization logic uses this MySQL pattern to create a new session on the first beacon and update the existing session on subsequent beacons. If the session_id already exists, update last_page, last_activity, page_count, and duration_seconds. If it does not exist, insert a new row.

8. The users Table

This table has nothing to do with analytics visitors. It stores dashboard login accounts — the people who log into your analytics dashboard to view reports. It is a standard authentication table with role-based access control.

Column Type Purpose
id INT UNSIGNED AUTO_INCREMENT Primary key
email VARCHAR(255) UNIQUE Login identifier — must be unique across all accounts
password_hash VARCHAR(255) Bcrypt or Argon2 hash — never store plaintext passwords
display_name VARCHAR(128) Human-readable name shown in the dashboard UI
role ENUM('owner','admin','viewer') Access level: owner has full control, admin can manage settings, viewer is read-only
created_at DATETIME DEFAULT CURRENT_TIMESTAMP Account creation time
last_login DATETIME Last successful login — useful for auditing and identifying inactive accounts
CREATE TABLE users (
    id             INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    email          VARCHAR(255) NOT NULL UNIQUE,
    password_hash  VARCHAR(255) NOT NULL,
    display_name   VARCHAR(128),
    role           ENUM('owner', 'admin', 'viewer') NOT NULL DEFAULT 'viewer',
    created_at     DATETIME DEFAULT CURRENT_TIMESTAMP,
    last_login     DATETIME
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Password storage: The password_hash column stores the output of password_hash() in PHP or bcrypt.hash() in Node.js. Never store plaintext passwords. Never use MD5 or SHA-256 for password hashing — these are fast hashes designed for integrity checks, not password storage. Use bcrypt, scrypt, or Argon2, which are intentionally slow to resist brute-force attacks.

9. Indexes for Analytics

Without indexes, every query performs a full table scan. For a pageviews table with millions of rows, that means reading every row from disk to answer a simple question like "how many page views happened yesterday." Indexes are the single most important performance optimization for analytics databases.

How Indexes Work (Briefly)

An index is a sorted copy of one or more columns stored in a B-tree structure. When MySQL receives a query with a WHERE clause on an indexed column, it traverses the B-tree to find matching rows in O(log n) time instead of scanning all n rows. For a table with 10 million rows, that is the difference between examining ~23 rows (log2 of 10M) versus 10 million rows.

Covering Indexes

A covering index contains all the columns a query needs, so MySQL can answer the query entirely from the index without touching the table data. This is the fastest possible query execution.

-- This query needs: server_timestamp (WHERE), url (SELECT), COUNT (aggregate)
-- A covering index on (server_timestamp, url) handles it entirely from the index:
CREATE INDEX idx_ts_url ON pageviews (server_timestamp, url(255));

SELECT url, COUNT(*) as views
FROM pageviews
WHERE server_timestamp BETWEEN '2024-01-01' AND '2024-01-07'
GROUP BY url
ORDER BY views DESC;

Key Indexes per Table

Table Index Supports
pageviews (server_timestamp) Date-range queries, "page views per day" charts
pageviews (session_id) Session-level aggregation, "pages per session" queries
pageviews (url(255), server_timestamp) Per-page traffic over time, "top pages this week"
events (event_name) "Click count by event type" queries
events (server_timestamp) Event volume over time
errors (server_timestamp) Error rate over time
errors (error_message(255)) Grouping errors by message for "top errors" report
performance (server_timestamp) Performance trends over time
performance (url(255), server_timestamp) Per-page performance breakdown
sessions (start_time) Session count per day, "unique visitors" approximation

Verifying Index Usage with EXPLAIN

Always verify that MySQL actually uses your indexes. The EXPLAIN statement shows the query execution plan:

EXPLAIN SELECT url, COUNT(*) as views
FROM pageviews
WHERE server_timestamp BETWEEN '2024-01-01' AND '2024-01-07'
GROUP BY url;

-- Look for:
--   type: range (using index for range scan, good)
--   key: idx_server_ts (your index is being used)
--   rows: 8432 (estimated rows examined, should be << total rows)
--
-- Red flags:
--   type: ALL (full table scan, no index used)
--   key: NULL (no index selected)
--   rows: 5000000 (examining most of the table)
Index tradeoff: Every index speeds up reads but slows down writes. Each INSERT must update every index on the table. For analytics workloads where data is written once and read many times, the tradeoff is overwhelmingly worth it. Do not be afraid to add indexes on analytics tables.

10. Normalization vs Denormalization

In a traditional application database, you would normalize aggressively: extract user_agent strings into a separate user_agents table with a foreign key, store URLs in a urls table, and reference them by ID. This eliminates data duplication and keeps the database compact.

Analytics schemas take a different approach. They are often intentionally denormalized for read performance.

Why Denormalize?

Consideration Normalized Denormalized
Query complexity JOINs required for every dashboard query Single-table queries, no JOINs
Query speed Slower — JOINs add latency, especially with millions of rows Faster — data is already in the row
Disk usage Smaller — shared strings stored once Larger — strings duplicated per row
Write speed Slower — may need INSERT into multiple tables per beacon Faster — single INSERT per beacon
Schema maintenance More tables, more foreign keys, more complexity Fewer tables, simpler schema

When to Normalize

Normalization still makes sense in specific cases:

For this project, keep it denormalized. At the data volumes a class project generates (thousands to tens of thousands of rows), the disk savings of normalization are negligible, while the query simplicity of denormalization makes development and debugging much easier. Normalize only when you can measure a real performance or storage problem.

11. Data Types

Choosing the right MySQL data type for each column affects storage efficiency, query performance, and data integrity. Here are the key decisions for an analytics schema:

VARCHAR vs TEXT

Type Max Size Indexable Use When
VARCHAR(n) 65,535 bytes Yes (with prefix length) Bounded-length strings: URLs, user agents, error messages
TEXT 65,535 bytes Only with prefix index Unbounded strings: stack traces, raw payloads

Use VARCHAR when you know the maximum length and want to enforce it at the database level. Use TEXT for columns like stack_trace where the content length is unpredictable and you do not need to index the full value.

INT vs BIGINT

Type Size Max Value (UNSIGNED) Use When
INT UNSIGNED 4 bytes 4,294,967,295 (~4.3 billion) Auto-increment PKs, counts, most analytics IDs
BIGINT 8 bytes ~9.2 quintillion Unix timestamps in milliseconds, very high-volume counters
SMALLINT UNSIGNED 2 bytes 65,535 Viewport dimensions, resource counts, page counts

The client_timestamp column uses BIGINT because JavaScript's Date.now() returns milliseconds since epoch (values like 1706745600000), which exceeds INT's 4.3 billion maximum.

DATETIME vs TIMESTAMP

Type Range Storage Timezone Behavior
DATETIME 1000-01-01 to 9999-12-31 8 bytes Stored as-is, no conversion
TIMESTAMP 1970-01-01 to 2038-01-19 4 bytes Stored as UTC, converted to session timezone on read

Use DATETIME for analytics timestamps. The TIMESTAMP type's 2038 limit is a real concern for data you plan to retain long-term, and the automatic timezone conversion can cause confusion when querying across timezones. Store everything in UTC and handle timezone conversion in the dashboard application layer.

JSON Column Type

MySQL 5.7+ supports a native JSON column type. It validates that inserted data is valid JSON, supports indexing via generated columns, and provides functions like JSON_EXTRACT(), JSON_UNQUOTE(), and ->> (shorthand for extract + unquote).

-- Extract a value from the JSON payload column:
SELECT payload->>'$.browser' AS browser
FROM pageviews
WHERE server_timestamp > '2024-01-01';

-- Create a generated column + index for frequent JSON queries:
ALTER TABLE events
ADD COLUMN scroll_depth INT
    GENERATED ALWAYS AS (JSON_EXTRACT(event_data, '$.scroll_pct')) VIRTUAL,
ADD INDEX idx_scroll_depth (scroll_depth);

Character Set: utf8mb4

Always use utf8mb4, not utf8. MySQL's utf8 is a legacy alias that only supports 3-byte characters, which cannot store emoji or some CJK characters. utf8mb4 is true UTF-8 with full 4-byte support. User-Agent strings, URLs, and page titles can all contain characters that require 4 bytes.

ENUM for Bounded Values

Use ENUM for columns with a small, fixed set of possible values. MySQL stores ENUM values as 1-2 byte integers internally, making them more storage-efficient and type-safe than VARCHAR:

-- ENUM for user roles (3 possible values, stored as 1 byte each):
role ENUM('owner', 'admin', 'viewer') NOT NULL DEFAULT 'viewer'

-- ENUM for beacon types (if your collector uses a fixed set):
type ENUM('pageview', 'spa_navigation', 'event', 'error', 'performance')

12. Partitioning by Date

As your analytics tables grow beyond millions of rows, even indexed queries can slow down because MySQL must scan a large index tree. RANGE partitioning on the server_timestamp column solves this by physically splitting the table into separate partitions, one per month (or week, or day).

How Partitioning Helps

When MySQL receives a query with a WHERE server_timestamp BETWEEN ... clause, it performs partition pruning — it skips partitions that cannot contain matching rows. A query for "last 7 days" on a table with 12 monthly partitions only scans 1 partition instead of all 12. This is automatic and requires no changes to your queries.

CREATE TABLE pageviews (
    id              INT UNSIGNED AUTO_INCREMENT,
    url             VARCHAR(2048) NOT NULL,
    type            VARCHAR(32) NOT NULL DEFAULT 'pageview',
    user_agent      VARCHAR(512),
    viewport_width  SMALLINT UNSIGNED,
    viewport_height SMALLINT UNSIGNED,
    referrer        VARCHAR(2048),
    client_timestamp BIGINT,
    server_timestamp DATETIME NOT NULL,
    client_ip       VARCHAR(45),
    session_id      VARCHAR(36) NOT NULL,
    payload         JSON,
    created_at      DATETIME DEFAULT CURRENT_TIMESTAMP,

    PRIMARY KEY (id, server_timestamp),
    INDEX idx_session (session_id),
    INDEX idx_url_ts  (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (TO_DAYS(server_timestamp)) (
    PARTITION p2024_01 VALUES LESS THAN (TO_DAYS('2024-02-01')),
    PARTITION p2024_02 VALUES LESS THAN (TO_DAYS('2024-03-01')),
    PARTITION p2024_03 VALUES LESS THAN (TO_DAYS('2024-04-01')),
    PARTITION p2024_04 VALUES LESS THAN (TO_DAYS('2024-05-01')),
    PARTITION p2024_05 VALUES LESS THAN (TO_DAYS('2024-06-01')),
    PARTITION p2024_06 VALUES LESS THAN (TO_DAYS('2024-07-01')),
    PARTITION p_future  VALUES LESS THAN MAXVALUE
);
Note the composite primary key: Partitioned tables in MySQL require the partitioning column to be part of every unique index, including the primary key. That is why the PK is (id, server_timestamp) instead of just (id).

Data Retention with DROP PARTITION

The biggest operational advantage of partitioning is data retention. When you need to delete old data, dropping a partition is nearly instantaneous, while a DELETE statement on millions of rows is painfully slow and generates enormous redo logs:

Approach Operation Speed Impact
DELETE DELETE FROM pageviews WHERE server_timestamp < '2024-01-01' Minutes to hours Locks rows, generates redo log, fragments table, requires OPTIMIZE TABLE after
DROP PARTITION ALTER TABLE pageviews DROP PARTITION p2023_12 Milliseconds Drops the partition file instantly, no row-level locking, no fragmentation
-- Delete 3 months of data in milliseconds:
ALTER TABLE pageviews DROP PARTITION p2024_01, p2024_02, p2024_03;

-- Add a new partition for next month:
ALTER TABLE pageviews REORGANIZE PARTITION p_future INTO (
    PARTITION p2024_07 VALUES LESS THAN (TO_DAYS('2024-08-01')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);
Plan your retention policy early. If you decide to partition an existing unpartitioned table, the migration requires copying all data to a new table structure. It is far easier to create the table with partitions from the start, even if you do not need retention management yet.

Tutorial Modules

01. Schema Design

Create the MySQL database and all six tables with proper data types, indexes, and character encoding. Write and execute the full schema DDL.

Start Module →

02. Analytics Queries

Write the SQL queries that power the dashboard: page views over time, top pages, session counts, performance percentiles, and error rankings.

Start Module →

03. Data Retention

Implement date-range partitioning, write a cron job to rotate partitions monthly, and configure a retention policy to drop data older than N months.

Start Module →