Storage & Schema Design

Analytics data is only useful if it is stored in a structure that makes querying fast and reliable. This section covers MySQL schema design for analytics — tables, indexes, data types, and retention strategies.

1. Why Schema Matters

A bad schema means slow queries, wasted disk, and frustrated dashboard users. Analytics queries are read-heavy and aggregate-heavy. The schema must optimize for GROUP BY, date ranges, and COUNT/AVG operations.

Consider the difference between a well-designed analytics schema and a naive one. A single pageviews table with no indexes on server_timestamp will require a full table scan every time someone loads the dashboard for "page views in the last 7 days." With a million rows, that query takes seconds. With ten million rows, it takes minutes. With proper indexes and partitioning, it takes milliseconds regardless of table size.

The schema decisions you make at the start of a project are among the hardest to change later. Adding an index is easy. Splitting a table, changing a primary key, or migrating from VARCHAR to ENUM requires downtime, data migration scripts, and careful testing. Get it right the first time.

Analytics schemas are different from application schemas. In a typical web application, the schema is normalized, optimized for writes, and designed around entities (users, orders, products). An analytics schema is optimized for reads, designed around events, and often intentionally denormalized to avoid expensive JOINs at query time.

2. Core Tables Overview

The analytics database uses six tables. Each serves a distinct purpose in the pipeline:

Table	Purpose
`pageviews`	Primary fact table — one row per page load, capturing URL, viewport, referrer, and timestamps
`events`	Custom interaction events from collector plugins (clicks, scrolls, form submissions)
`errors`	JavaScript errors captured by the collector's error tracking module
`performance`	Navigation Timing metrics and Core Web Vitals (LCP, CLS, INP) per page load
`sessions`	Aggregated visit data — entry/exit pages, page count, duration
`users`	Dashboard login accounts (NOT analytics visitors) — authentication and role management

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ sessions │ │ pageviews │ │ events │ ├──────────────┤ ├──────────────┤ ├──────────────┤ │ session_id │<──┬──>│ session_id │ │ session_id │──┐ │ first_page │ │ │ id (PK) │ │ id (PK) │ │ │ last_page │ │ │ url │ │ event_name │ │ │ page_count │ │ │ type │ │ event_category│ │ │ start_time │ │ │ user_agent │ │ event_data │ │ │ last_activity│ │ │ viewport_w/h │ │ url │ │ │ duration_s │ │ │ referrer │ │ server_ts │ │ │ referrer │ │ │ client_ts │ │ created_at │ │ │ user_agent │ │ │ server_ts │ └──────────────┘ │ └──────────────┘ │ │ client_ip │ │ │ │ payload │ │ │ │ created_at │ │ │ └──────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ errors │ │ performance │ │ │ ├──────────────┤ ├──────────────┤ │ ├──>│ session_id │ │ session_id │<─┘ │ │ id (PK) │ │ id (PK) │ │ │ error_message│ │ url │ │ │ error_source │ │ ttfb │ │ │ error_line │ │ dom_complete │ │ │ stack_trace │ │ load_time │ │ │ url │ │ lcp, cls, inp│ │ │ server_ts │ │ fcp │ │ │ created_at │ │ transfer_size│ │ └──────────────┘ │ server_ts │ │ │ created_at │ │ └──────────────┘ │ │ ┌──────────────┐ │ │ users │ (dashboard accounts, │ ├──────────────┤ no FK to analytics │ │ id (PK) │ tables) │ │ email │ │ │ password_hash│ │ │ display_name │ │ │ role (ENUM) │ │ │ created_at │ │ │ last_login │ │ └──────────────┘ session_id links pageviews, events, errors, and performance to a single session record. The users table is independent — it stores dashboard accounts, not analytics visitors.

3. The pageviews Table

This is the primary fact table — the central table that most dashboard queries will hit. Every page load produces exactly one row here. It captures the data the collector sends in its initial beacon plus the enrichment fields added by the server.

Column	Type	Why It Exists
`id`	INT UNSIGNED AUTO_INCREMENT	Surrogate primary key for fast lookups and JOINs
`url`	VARCHAR(2048)	The page URL visited — the most queried column for "top pages" reports
`type`	VARCHAR(32)	Beacon type identifier (`pageview`, `spa_navigation`) for filtering event categories
`user_agent`	VARCHAR(512)	Raw User-Agent string from the server (not client-reported) for browser/OS breakdowns
`viewport_width`	SMALLINT UNSIGNED	Browser viewport width in pixels — used for device-type analysis (mobile vs desktop)
`viewport_height`	SMALLINT UNSIGNED	Browser viewport height in pixels — paired with width for screen-size distributions
`referrer`	VARCHAR(2048)	The referring URL — where the visitor came from. Empty string for direct visits
`client_timestamp`	BIGINT	Unix epoch in milliseconds from the client — may be inaccurate but useful for clock-drift analysis
`server_timestamp`	DATETIME	Authoritative timestamp set by the server at beacon receipt — the column you query for date ranges
`client_ip`	VARCHAR(45)	IPv4 or IPv6 address for geolocation lookups (45 chars covers the longest IPv6 representation)
`session_id`	VARCHAR(36)	UUID linking this pageview to a session record — the JOIN key for session-level queries
`payload`	JSON	Catch-all for extra data the collector sends that does not map to a dedicated column — avoids schema changes for new fields
`created_at`	DATETIME DEFAULT CURRENT_TIMESTAMP	Row insertion time for auditing — usually identical to `server_timestamp` but decoupled for clarity

CREATE TABLE pageviews (
    id              INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    url             VARCHAR(2048) NOT NULL,
    type            VARCHAR(32) NOT NULL DEFAULT 'pageview',
    user_agent      VARCHAR(512),
    viewport_width  SMALLINT UNSIGNED,
    viewport_height SMALLINT UNSIGNED,
    referrer        VARCHAR(2048),
    client_timestamp BIGINT,
    server_timestamp DATETIME NOT NULL,
    client_ip       VARCHAR(45),
    session_id      VARCHAR(36) NOT NULL,
    payload         JSON,
    created_at      DATETIME DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_server_ts (server_timestamp),
    INDEX idx_session   (session_id),
    INDEX idx_url_ts    (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Why url(255) in the index? MySQL InnoDB limits index key prefixes to 3072 bytes. A VARCHAR(2048) column in utf8mb4 would require 8192 bytes, far exceeding the limit. Indexing the first 255 characters covers the vast majority of real-world URLs while staying within bounds.

4. The events Table

Custom events come from collector plugins — click tracking, scroll depth measurement, form submission monitoring, and any other interaction you want to capture. Each event is a discrete action that happened on a page, linked to a session.

Column	Type	Purpose
`id`	INT UNSIGNED AUTO_INCREMENT	Primary key
`session_id`	VARCHAR(36)	Links to the session this event belongs to
`event_name`	VARCHAR(128)	Machine-readable event identifier (e.g., `click`, `scroll_depth`, `form_submit`)
`event_category`	VARCHAR(128)	Grouping label for dashboards (e.g., `interaction`, `engagement`, `conversion`)
`event_data`	JSON	Flexible payload — click coordinates, scroll percentage, form field count, etc.
`url`	VARCHAR(2048)	The page where the event occurred
`server_timestamp`	DATETIME	Authoritative time of event receipt
`created_at`	DATETIME DEFAULT CURRENT_TIMESTAMP	Row insertion time

CREATE TABLE events (
    id               INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    session_id       VARCHAR(36) NOT NULL,
    event_name       VARCHAR(128) NOT NULL,
    event_category   VARCHAR(128),
    event_data       JSON,
    url              VARCHAR(2048),
    server_timestamp DATETIME NOT NULL,
    created_at       DATETIME DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_session    (session_id),
    INDEX idx_event_name (event_name),
    INDEX idx_server_ts  (server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

The event_data JSON column is the key design decision here. Rather than creating separate columns for every possible event payload (click x/y coordinates, scroll depth percentage, form field name), a JSON column keeps the schema stable while allowing each event type to carry different data. Dashboard queries can use MySQL's JSON_EXTRACT() function to pull specific values when needed.

5. The errors Table

JavaScript errors captured by the collector's error tracking module. Each row represents a single error occurrence on a visitor's browser. This table powers the "Errors" section of the dashboard, showing error frequency, affected pages, and browser distributions.

Column	Type	Purpose
`id`	INT UNSIGNED AUTO_INCREMENT	Primary key
`session_id`	VARCHAR(36)	Links error to a session for per-visit error analysis
`error_message`	VARCHAR(1024)	The error message (e.g., `TypeError: Cannot read properties of null`)
`error_source`	VARCHAR(2048)	Script URL where the error originated
`error_line`	INT UNSIGNED	Line number in the source file
`error_column`	INT UNSIGNED	Column number in the source file
`stack_trace`	TEXT	Full stack trace — can be very long, so TEXT instead of VARCHAR
`url`	VARCHAR(2048)	The page the user was on when the error occurred
`user_agent`	VARCHAR(512)	Browser identification for "which browsers produce this error?" analysis
`server_timestamp`	DATETIME	Authoritative time of error receipt
`created_at`	DATETIME DEFAULT CURRENT_TIMESTAMP	Row insertion time

CREATE TABLE errors (
    id               INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    session_id       VARCHAR(36) NOT NULL,
    error_message    VARCHAR(1024),
    error_source     VARCHAR(2048),
    error_line       INT UNSIGNED,
    error_column     INT UNSIGNED,
    stack_trace      TEXT,
    url              VARCHAR(2048),
    user_agent       VARCHAR(512),
    server_timestamp DATETIME NOT NULL,
    created_at       DATETIME DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_session   (session_id),
    INDEX idx_server_ts (server_timestamp),
    INDEX idx_message   (error_message(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Why store user_agent in the errors table? Many JavaScript errors are browser-specific. A TypeError that only appears in Safari 16 or a ReferenceError caused by a missing API in older Firefox builds becomes immediately visible when you can group errors by User-Agent. Duplicating this column avoids a JOIN to the pageviews table for the most common error-analysis queries.

6. The performance Table

Navigation Timing metrics and Core Web Vitals collected by the performance timing and web vitals modules. Each row corresponds to a single page load and captures the full timing breakdown.

Column	Type	Purpose
`id`	INT UNSIGNED AUTO_INCREMENT	Primary key
`session_id`	VARCHAR(36)	Links to session for per-visit performance analysis
`url`	VARCHAR(2048)	The page whose performance was measured
`ttfb`	FLOAT	Time to First Byte in ms — server responsiveness
`dom_content_loaded`	FLOAT	DOMContentLoaded event time in ms — HTML parsed, DOM ready
`dom_complete`	FLOAT	DOM complete time in ms — all subresources finished loading
`load_time`	FLOAT	Full page load time in ms (load event fired)
`lcp`	FLOAT	Largest Contentful Paint in ms — Core Web Vital
`cls`	FLOAT	Cumulative Layout Shift (unitless) — Core Web Vital
`inp`	FLOAT	Interaction to Next Paint in ms — Core Web Vital
`fcp`	FLOAT	First Contentful Paint in ms — first visible content rendered
`transfer_size`	INT UNSIGNED	Total bytes transferred for the page and its resources
`resource_count`	SMALLINT UNSIGNED	Number of resources loaded (scripts, stylesheets, images, etc.)
`server_timestamp`	DATETIME	Authoritative time of beacon receipt
`created_at`	DATETIME DEFAULT CURRENT_TIMESTAMP	Row insertion time

CREATE TABLE performance (
    id                  INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    session_id          VARCHAR(36) NOT NULL,
    url                 VARCHAR(2048),
    ttfb                FLOAT,
    dom_content_loaded  FLOAT,
    dom_complete        FLOAT,
    load_time           FLOAT,
    lcp                 FLOAT,
    cls                 FLOAT,
    inp                 FLOAT,
    fcp                 FLOAT,
    transfer_size       INT UNSIGNED,
    resource_count      SMALLINT UNSIGNED,
    server_timestamp    DATETIME NOT NULL,
    created_at          DATETIME DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_session   (session_id),
    INDEX idx_server_ts (server_timestamp),
    INDEX idx_url_ts    (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

All timing columns use FLOAT because sub-millisecond precision matters for Web Vitals (CLS values like 0.042 are common). FLOAT provides adequate precision for timing values while using only 4 bytes per column — half the size of DOUBLE.

7. The sessions Table

The sessions table stores aggregated visit data computed by the sessionization module. Rather than recalculating session-level metrics from raw pageviews every time the dashboard loads, this table pre-computes them. It is updated incrementally as new beacons arrive for an active session.

Column	Type	Purpose
`session_id`	VARCHAR(36) PRIMARY KEY	UUID generated at session start — the primary key, not auto-increment
`first_page`	VARCHAR(2048)	Landing page URL — where the visitor entered the site
`last_page`	VARCHAR(2048)	Exit page URL — the last page visited (updated with each beacon)
`page_count`	SMALLINT UNSIGNED	Number of pages viewed in this session (incremented per pageview beacon)
`start_time`	DATETIME	Timestamp of the first beacon in this session
`last_activity`	DATETIME	Timestamp of the most recent beacon (used for 30-minute timeout logic)
`duration_seconds`	INT UNSIGNED	Elapsed time from first to last beacon — computed as `TIMESTAMPDIFF(SECOND, start_time, last_activity)`
`referrer`	VARCHAR(2048)	Referrer from the first beacon — traffic source attribution
`user_agent`	VARCHAR(512)	User-Agent from the first beacon — device/browser identification

CREATE TABLE sessions (
    session_id       VARCHAR(36) PRIMARY KEY,
    first_page       VARCHAR(2048),
    last_page        VARCHAR(2048),
    page_count       SMALLINT UNSIGNED DEFAULT 1,
    start_time       DATETIME NOT NULL,
    last_activity    DATETIME NOT NULL,
    duration_seconds INT UNSIGNED DEFAULT 0,
    referrer         VARCHAR(2048),
    user_agent       VARCHAR(512),

    INDEX idx_start_time    (start_time),
    INDEX idx_last_activity (last_activity)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Notice that session_id is the primary key, not an auto-increment integer. This is because session IDs are generated by the server-processing pipeline (as UUIDs) and must be referenced by other tables before the session row may even exist. Using the UUID directly as the PK avoids a lookup to find the auto-increment ID.

INSERT ... ON DUPLICATE KEY UPDATE: The sessionization logic uses this MySQL pattern to create a new session on the first beacon and update the existing session on subsequent beacons. If the session_id already exists, update last_page, last_activity, page_count, and duration_seconds. If it does not exist, insert a new row.

8. The users Table

This table has nothing to do with analytics visitors. It stores dashboard login accounts — the people who log into your analytics dashboard to view reports. It is a standard authentication table with role-based access control.

Column	Type	Purpose
`id`	INT UNSIGNED AUTO_INCREMENT	Primary key
`email`	VARCHAR(255) UNIQUE	Login identifier — must be unique across all accounts
`password_hash`	VARCHAR(255)	Bcrypt or Argon2 hash — never store plaintext passwords
`display_name`	VARCHAR(128)	Human-readable name shown in the dashboard UI
`role`	ENUM('owner','admin','viewer')	Access level: owner has full control, admin can manage settings, viewer is read-only
`created_at`	DATETIME DEFAULT CURRENT_TIMESTAMP	Account creation time
`last_login`	DATETIME	Last successful login — useful for auditing and identifying inactive accounts

CREATE TABLE users (
    id             INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    email          VARCHAR(255) NOT NULL UNIQUE,
    password_hash  VARCHAR(255) NOT NULL,
    display_name   VARCHAR(128),
    role           ENUM('owner', 'admin', 'viewer') NOT NULL DEFAULT 'viewer',
    created_at     DATETIME DEFAULT CURRENT_TIMESTAMP,
    last_login     DATETIME
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Password storage: The password_hash column stores the output of password_hash() in PHP or bcrypt.hash() in Node.js. Never store plaintext passwords. Never use MD5 or SHA-256 for password hashing — these are fast hashes designed for integrity checks, not password storage. Use bcrypt, scrypt, or Argon2, which are intentionally slow to resist brute-force attacks.

9. Indexes for Analytics

Without indexes, every query performs a full table scan. For a pageviews table with millions of rows, that means reading every row from disk to answer a simple question like "how many page views happened yesterday." Indexes are the single most important performance optimization for analytics databases.

How Indexes Work (Briefly)

An index is a sorted copy of one or more columns stored in a B-tree structure. When MySQL receives a query with a WHERE clause on an indexed column, it traverses the B-tree to find matching rows in O(log n) time instead of scanning all n rows. For a table with 10 million rows, that is the difference between examining ~23 rows (log2 of 10M) versus 10 million rows.

Covering Indexes

A covering index contains all the columns a query needs, so MySQL can answer the query entirely from the index without touching the table data. This is the fastest possible query execution.

-- This query needs: server_timestamp (WHERE), url (SELECT), COUNT (aggregate)
-- A covering index on (server_timestamp, url) handles it entirely from the index:
CREATE INDEX idx_ts_url ON pageviews (server_timestamp, url(255));

SELECT url, COUNT(*) as views
FROM pageviews
WHERE server_timestamp BETWEEN '2024-01-01' AND '2024-01-07'
GROUP BY url
ORDER BY views DESC;

Key Indexes per Table

Table	Index	Supports
`pageviews`	`(server_timestamp)`	Date-range queries, "page views per day" charts
`pageviews`	`(session_id)`	Session-level aggregation, "pages per session" queries
`pageviews`	`(url(255), server_timestamp)`	Per-page traffic over time, "top pages this week"
`events`	`(event_name)`	"Click count by event type" queries
`events`	`(server_timestamp)`	Event volume over time
`errors`	`(server_timestamp)`	Error rate over time
`errors`	`(error_message(255))`	Grouping errors by message for "top errors" report
`performance`	`(server_timestamp)`	Performance trends over time
`performance`	`(url(255), server_timestamp)`	Per-page performance breakdown
`sessions`	`(start_time)`	Session count per day, "unique visitors" approximation

Verifying Index Usage with EXPLAIN

Always verify that MySQL actually uses your indexes. The EXPLAIN statement shows the query execution plan:

EXPLAIN SELECT url, COUNT(*) as views
FROM pageviews
WHERE server_timestamp BETWEEN '2024-01-01' AND '2024-01-07'
GROUP BY url;

-- Look for:
--   type: range (using index for range scan, good)
--   key: idx_server_ts (your index is being used)
--   rows: 8432 (estimated rows examined, should be << total rows)
--
-- Red flags:
--   type: ALL (full table scan, no index used)
--   key: NULL (no index selected)
--   rows: 5000000 (examining most of the table)

Index tradeoff: Every index speeds up reads but slows down writes. Each INSERT must update every index on the table. For analytics workloads where data is written once and read many times, the tradeoff is overwhelmingly worth it. Do not be afraid to add indexes on analytics tables.

10. Normalization vs Denormalization

In a traditional application database, you would normalize aggressively: extract user_agent strings into a separate user_agents table with a foreign key, store URLs in a urls table, and reference them by ID. This eliminates data duplication and keeps the database compact.

Analytics schemas take a different approach. They are often intentionally denormalized for read performance.

Why Denormalize?

Consideration	Normalized	Denormalized
Query complexity	JOINs required for every dashboard query	Single-table queries, no JOINs
Query speed	Slower — JOINs add latency, especially with millions of rows	Faster — data is already in the row
Disk usage	Smaller — shared strings stored once	Larger — strings duplicated per row
Write speed	Slower — may need INSERT into multiple tables per beacon	Faster — single INSERT per beacon
Schema maintenance	More tables, more foreign keys, more complexity	Fewer tables, simpler schema

When to Normalize

Normalization still makes sense in specific cases:

User-Agent strings: At high volume, the same 200-character UA string appears thousands of times. Extracting to a user_agents table with a SMALLINT FK can save significant disk space.
URLs: If your site has a small number of pages (under 1,000), a URL lookup table reduces storage and makes URL renaming easier.
Geographic data: IP-to-country/city mappings naturally belong in a separate lookup table.

For this project, keep it denormalized. At the data volumes a class project generates (thousands to tens of thousands of rows), the disk savings of normalization are negligible, while the query simplicity of denormalization makes development and debugging much easier. Normalize only when you can measure a real performance or storage problem.

11. Data Types

Choosing the right MySQL data type for each column affects storage efficiency, query performance, and data integrity. Here are the key decisions for an analytics schema:

VARCHAR vs TEXT

Type	Max Size	Indexable	Use When
`VARCHAR(n)`	65,535 bytes	Yes (with prefix length)	Bounded-length strings: URLs, user agents, error messages
`TEXT`	65,535 bytes	Only with prefix index	Unbounded strings: stack traces, raw payloads

Use VARCHAR when you know the maximum length and want to enforce it at the database level. Use TEXT for columns like stack_trace where the content length is unpredictable and you do not need to index the full value.

INT vs BIGINT

Type	Size	Max Value (UNSIGNED)	Use When
`INT UNSIGNED`	4 bytes	4,294,967,295 (~4.3 billion)	Auto-increment PKs, counts, most analytics IDs
`BIGINT`	8 bytes	~9.2 quintillion	Unix timestamps in milliseconds, very high-volume counters
`SMALLINT UNSIGNED`	2 bytes	65,535	Viewport dimensions, resource counts, page counts

The client_timestamp column uses BIGINT because JavaScript's Date.now() returns milliseconds since epoch (values like 1706745600000), which exceeds INT's 4.3 billion maximum.

DATETIME vs TIMESTAMP

Type	Range	Storage	Timezone Behavior
`DATETIME`	1000-01-01 to 9999-12-31	8 bytes	Stored as-is, no conversion
`TIMESTAMP`	1970-01-01 to 2038-01-19	4 bytes	Stored as UTC, converted to session timezone on read

Use DATETIME for analytics timestamps. The TIMESTAMP type's 2038 limit is a real concern for data you plan to retain long-term, and the automatic timezone conversion can cause confusion when querying across timezones. Store everything in UTC and handle timezone conversion in the dashboard application layer.

JSON Column Type

MySQL 5.7+ supports a native JSON column type. It validates that inserted data is valid JSON, supports indexing via generated columns, and provides functions like JSON_EXTRACT(), JSON_UNQUOTE(), and ->> (shorthand for extract + unquote).

-- Extract a value from the JSON payload column:
SELECT payload->>'$.browser' AS browser
FROM pageviews
WHERE server_timestamp > '2024-01-01';

-- Create a generated column + index for frequent JSON queries:
ALTER TABLE events
ADD COLUMN scroll_depth INT
    GENERATED ALWAYS AS (JSON_EXTRACT(event_data, '$.scroll_pct')) VIRTUAL,
ADD INDEX idx_scroll_depth (scroll_depth);

Character Set: utf8mb4

Always use utf8mb4, not utf8. MySQL's utf8 is a legacy alias that only supports 3-byte characters, which cannot store emoji or some CJK characters. utf8mb4 is true UTF-8 with full 4-byte support. User-Agent strings, URLs, and page titles can all contain characters that require 4 bytes.

ENUM for Bounded Values

Use ENUM for columns with a small, fixed set of possible values. MySQL stores ENUM values as 1-2 byte integers internally, making them more storage-efficient and type-safe than VARCHAR:

-- ENUM for user roles (3 possible values, stored as 1 byte each):
role ENUM('owner', 'admin', 'viewer') NOT NULL DEFAULT 'viewer'

-- ENUM for beacon types (if your collector uses a fixed set):
type ENUM('pageview', 'spa_navigation', 'event', 'error', 'performance')

12. Partitioning by Date

As your analytics tables grow beyond millions of rows, even indexed queries can slow down because MySQL must scan a large index tree. RANGE partitioning on the server_timestamp column solves this by physically splitting the table into separate partitions, one per month (or week, or day).

How Partitioning Helps

When MySQL receives a query with a WHERE server_timestamp BETWEEN ... clause, it performs partition pruning — it skips partitions that cannot contain matching rows. A query for "last 7 days" on a table with 12 monthly partitions only scans 1 partition instead of all 12. This is automatic and requires no changes to your queries.

CREATE TABLE pageviews (
    id              INT UNSIGNED AUTO_INCREMENT,
    url             VARCHAR(2048) NOT NULL,
    type            VARCHAR(32) NOT NULL DEFAULT 'pageview',
    user_agent      VARCHAR(512),
    viewport_width  SMALLINT UNSIGNED,
    viewport_height SMALLINT UNSIGNED,
    referrer        VARCHAR(2048),
    client_timestamp BIGINT,
    server_timestamp DATETIME NOT NULL,
    client_ip       VARCHAR(45),
    session_id      VARCHAR(36) NOT NULL,
    payload         JSON,
    created_at      DATETIME DEFAULT CURRENT_TIMESTAMP,

    PRIMARY KEY (id, server_timestamp),
    INDEX idx_session (session_id),
    INDEX idx_url_ts  (url(255), server_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (TO_DAYS(server_timestamp)) (
    PARTITION p2024_01 VALUES LESS THAN (TO_DAYS('2024-02-01')),
    PARTITION p2024_02 VALUES LESS THAN (TO_DAYS('2024-03-01')),
    PARTITION p2024_03 VALUES LESS THAN (TO_DAYS('2024-04-01')),
    PARTITION p2024_04 VALUES LESS THAN (TO_DAYS('2024-05-01')),
    PARTITION p2024_05 VALUES LESS THAN (TO_DAYS('2024-06-01')),
    PARTITION p2024_06 VALUES LESS THAN (TO_DAYS('2024-07-01')),
    PARTITION p_future  VALUES LESS THAN MAXVALUE
);

Note the composite primary key: Partitioned tables in MySQL require the partitioning column to be part of every unique index, including the primary key. That is why the PK is (id, server_timestamp) instead of just (id).

Data Retention with DROP PARTITION

The biggest operational advantage of partitioning is data retention. When you need to delete old data, dropping a partition is nearly instantaneous, while a DELETE statement on millions of rows is painfully slow and generates enormous redo logs:

Approach	Operation	Speed	Impact
`DELETE`	`DELETE FROM pageviews WHERE server_timestamp < '2024-01-01'`	Minutes to hours	Locks rows, generates redo log, fragments table, requires `OPTIMIZE TABLE` after
`DROP PARTITION`	`ALTER TABLE pageviews DROP PARTITION p2023_12`	Milliseconds	Drops the partition file instantly, no row-level locking, no fragmentation

-- Delete 3 months of data in milliseconds:
ALTER TABLE pageviews DROP PARTITION p2024_01, p2024_02, p2024_03;

-- Add a new partition for next month:
ALTER TABLE pageviews REORGANIZE PARTITION p_future INTO (
    PARTITION p2024_07 VALUES LESS THAN (TO_DAYS('2024-08-01')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

Plan your retention policy early. If you decide to partition an existing unpartitioned table, the migration requires copying all data to a new table structure. It is far easier to create the table with partitions from the start, even if you do not need retention management yet.

Tutorial Modules

01. Schema Design

Create the MySQL database and all six tables with proper data types, indexes, and character encoding. Write and execute the full schema DDL.

Start Module →

02. Analytics Queries

Write the SQL queries that power the dashboard: page views over time, top pages, session counts, performance percentiles, and error rankings.

Start Module →

03. Data Retention

Implement date-range partitioning, write a cron job to rotate partitions monthly, and configure a retention policy to drop data older than N months.

Start Module →