Module 03: Data Retention

Analytics data grows fast. A site with moderate traffic can accumulate millions of rows per month. This module covers strategies for managing data growth: range partitioning for fast date queries and easy purging, summary aggregation tables for long-term trends, and automated cleanup scripts.

Demo Files

1. The Growth Problem

Consider a site with moderate traffic — 10,000 pageviews per day. Over a full year that is:

10,000 pageviews/day x 365 days = 3,650,000 rows/year At ~500 bytes per row: 3,650,000 x 500 bytes = ~1.7 GB (pageviews alone) Add events, errors, performance timing data: Easily 5–10 GB/year of raw analytics data

This is just one table. When you factor in custom events, JavaScript error logs, and performance timing entries, total storage can reach 5–10 GB per year even for a moderately trafficked site. A busy site with 100K+ daily pageviews will hit these numbers in a month.

The real cost is not disk space — it is query performance. A SELECT that scans 10 million rows to produce a date-range report becomes painfully slow. Indexes help, but they grow too, consuming memory and slowing writes.

Key insight: Analytics data has a natural time dimension. Recent data is queried frequently and needs to be fast. Historical data is queried rarely and can be summarized. This asymmetry is the basis for every retention strategy.

2. Range Partitioning

MySQL supports range partitioning, which physically divides a table into separate storage segments based on a column value. For analytics, the obvious partition key is the timestamp column.

When you partition the pageviews table by month on server_timestamp, MySQL stores each month's data in a separate physical file. A query with a WHERE server_timestamp BETWEEN '2025-03-01' AND '2025-03-31' clause only reads the March partition — it skips all other months entirely.

Table: pageviews_partitioned +-------------+ +-------------+ +-------------+ +-------------+ | p2025_01 | | p2025_02 | | p2025_03 | | p2025_04 | | Jan 2025 | | Feb 2025 | | Mar 2025 | | Apr 2025 | | ~300K rows | | ~280K rows | | ~310K rows | | ~295K rows | +-------------+ +-------------+ +-------------+ +-------------+ | ^ | | v Query only touches DROP PARTITION p2025_01; this partition for (instant removal of March date range 300K rows — no row- by-row DELETE)

You can verify partition pruning with EXPLAIN:

EXPLAIN SELECT * FROM pageviews_partitioned
WHERE server_timestamp BETWEEN '2025-03-01' AND '2025-03-31';

+----+-------------+------------------------+------------+------+...+
| id | select_type | table                  | partitions | type |...+
+----+-------------+------------------------+------------+------+...+
|  1 | SIMPLE      | pageviews_partitioned  | p2025_03   | ALL  |...+
+----+-------------+------------------------+------------+------+...+

-- "partitions: p2025_03" confirms only one partition is scanned
Important constraint: In MySQL, the partition key must be part of every unique index (including the primary key). This means the primary key must be a composite key like PRIMARY KEY (id, server_timestamp) rather than just PRIMARY KEY (id). This is a trade-off: you lose the ability to look up a row by id alone, but you gain massive performance improvements for time-range queries.

3. Creating Partitioned Tables

The retention.sql file contains the full DDL. Here is the key structure:

CREATE TABLE IF NOT EXISTS pageviews_partitioned (
    id                INT AUTO_INCREMENT,
    url               VARCHAR(2048) NOT NULL,
    type              ENUM('pageview','event','error','performance') NOT NULL DEFAULT 'pageview',
    user_agent        VARCHAR(512),
    viewport_width    SMALLINT UNSIGNED,
    viewport_height   SMALLINT UNSIGNED,
    referrer          VARCHAR(2048),
    client_timestamp  DATETIME,
    server_timestamp  DATETIME NOT NULL,
    client_ip         VARCHAR(45),
    session_id        VARCHAR(64),
    payload           JSON,
    created_at        TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    PRIMARY KEY (id, server_timestamp),   -- partition key must be in PK
    INDEX idx_pv_url (url(255)),
    INDEX idx_pv_type (type),
    INDEX idx_pv_session_id (session_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (TO_DAYS(server_timestamp)) (
    PARTITION p2025_01 VALUES LESS THAN (TO_DAYS('2025-02-01')),
    PARTITION p2025_02 VALUES LESS THAN (TO_DAYS('2025-03-01')),
    ...
    PARTITION p2025_12 VALUES LESS THAN (TO_DAYS('2026-01-01')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

The PARTITION BY RANGE (TO_DAYS(server_timestamp)) clause tells MySQL to use the day number of the timestamp for partitioning. Each partition specifies an upper bound with VALUES LESS THAN.

The p_future catch-all partition ensures that rows with future dates are not rejected. When a new month approaches, you reorganize this partition to carve out a new named partition:

-- Before January 2026 starts, add a new partition:
ALTER TABLE pageviews_partitioned REORGANIZE PARTITION p_future INTO (
    PARTITION p2026_01 VALUES LESS THAN (TO_DAYS('2026-02-01')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

This operation is fast because p_future should contain few or no rows (data for the future has not arrived yet).

Tip: If you are adding partitioning to an existing table, you must recreate the table with the new partition scheme. MySQL does not allow you to add PARTITION BY RANGE to an existing table if the primary key does not already include the partition column. Plan for partitioning from the start if possible.

4. Summary Aggregation Tables

Partitioning makes date-range queries fast, but dashboards often need high-level trends: total pageviews per day, unique sessions, average load time. Running these aggregations on millions of raw rows every time the dashboard loads is wasteful.

The solution is a summary table — a pre-computed, denormalized table that stores one row per day with the key metrics already calculated:

CREATE TABLE IF NOT EXISTS daily_summary (
    id               INT AUTO_INCREMENT PRIMARY KEY,
    day              DATE NOT NULL UNIQUE,
    total_pageviews  INT UNSIGNED NOT NULL DEFAULT 0,
    unique_sessions  INT UNSIGNED NOT NULL DEFAULT 0,
    avg_load_time_ms INT UNSIGNED,
    total_errors     INT UNSIGNED NOT NULL DEFAULT 0,
    top_page         VARCHAR(2048),
    created_at       TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_ds_day (day)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

A nightly stored procedure populates this table by querying the raw tables and inserting (or updating) the summary row for the previous day. The retention.sql file includes the full aggregate_daily procedure.

Query Type Data Source Example
Last 30 days trend chart daily_summary 30 rows, instant response
Year-over-year comparison daily_summary 365 rows, still fast
Today's live pageview list pageviews (raw) Current partition only
Drill-down: specific page today pageviews (raw) Filtered by URL + date

This two-tier approach is standard in analytics: the dashboard defaults to summary data for speed, and switches to raw data only when the user drills down into a specific time range or filter.

Upsert pattern: The aggregation procedure uses INSERT ... ON DUPLICATE KEY UPDATE so it can be safely re-run for the same day without creating duplicate rows. If the nightly job fails and is retried, it simply overwrites the existing summary row.

5. Purge Scripts

Once data has been summarized, you do not need the raw rows forever. Purging old partitions is the fastest way to delete large volumes of data — it is an O(1) metadata operation, not a row-by-row delete:

-- Instantly remove all January 2025 data:
ALTER TABLE pageviews_partitioned DROP PARTITION p2025_01;

Compare this to a traditional DELETE:

-- Slow: scans and deletes row by row, generates undo log
DELETE FROM pageviews
WHERE server_timestamp < '2025-02-01';
-- Could take minutes for millions of rows, locks the table

DROP PARTITION removes the entire physical file in one operation. No row scanning, no undo log, no lock contention. This is one of the strongest practical reasons to use partitioning for analytics tables.

For non-partitioned tables (or tables you cannot easily repartition), use a batched delete loop to avoid long-running locks:

-- Delete in batches of 10,000 to avoid long locks
DELETE FROM pageviews
WHERE server_timestamp < DATE_SUB(NOW(), INTERVAL 6 MONTH)
LIMIT 10000;
-- Run in a loop until 0 rows affected
Before purging: Always verify that the summary aggregation has completed for the time range you are about to drop. If you drop a partition before summarizing it, that data is gone permanently. A safe workflow is: aggregate first, verify the summary row exists, then purge.

6. Automation

Manual retention management does not scale. You need automated processes for two recurring tasks:

Nightly aggregation — Summarize yesterday's data into the daily_summary table. This should run after midnight when the previous day's data is complete.

Monthly partition management — Add a new partition before the month starts, and optionally drop partitions older than your retention window (e.g., 6 months).

Option A: MySQL EVENT Scheduler

MySQL has a built-in event scheduler that can run stored procedures on a schedule:

-- Enable the event scheduler (usually in my.cnf for persistence)
SET GLOBAL event_scheduler = ON;

-- Nightly aggregation at 2:00 AM
CREATE EVENT IF NOT EXISTS evt_nightly_aggregate
ON SCHEDULE EVERY 1 DAY
STARTS CURRENT_DATE + INTERVAL 1 DAY + INTERVAL 2 HOUR
DO CALL aggregate_daily(DATE_SUB(CURDATE(), INTERVAL 1 DAY));

Option B: External Cron Job

If you prefer to keep scheduling outside the database (easier to monitor and log), use a system cron job:

# /etc/cron.d/analytics-retention

# Nightly aggregation at 2:00 AM
0 2 * * * root mysql -u analytics -p'password' analytics_db \
    -e "CALL aggregate_daily(DATE_SUB(CURDATE(), INTERVAL 1 DAY));"

# Monthly: add new partition on the 25th of each month
0 3 25 * * root mysql -u analytics -p'password' analytics_db \
    -e "ALTER TABLE pageviews_partitioned REORGANIZE PARTITION p_future INTO (...);"
Which approach? The MySQL EVENT scheduler is simpler for a course project — everything stays inside the database. For production systems, external cron (or a job scheduler like systemd timers) is preferred because it integrates with monitoring, logging, and alerting infrastructure.

A complete retention strategy ties all three pieces together:

Nightly (2:00 AM): 1. CALL aggregate_daily(yesterday) -- summarize raw data 2. Verify daily_summary row exists -- safety check Monthly (25th, 3:00 AM): 3. REORGANIZE p_future -- add next month's partition 4. DROP PARTITION p_old -- remove expired month (only if older than retention window) Result: - Raw data: kept for 6 months (configurable) - Summary data: kept indefinitely (tiny — 365 rows/year) - Queries stay fast: partitions keep table scans bounded