Analytics data grows fast. A site with moderate traffic can accumulate millions of rows per month. This module covers strategies for managing data growth: range partitioning for fast date queries and easy purging, summary aggregation tables for long-term trends, and automated cleanup scripts.
Consider a site with moderate traffic — 10,000 pageviews per day. Over a full year that is:
This is just one table. When you factor in custom events, JavaScript error logs, and performance timing entries, total storage can reach 5–10 GB per year even for a moderately trafficked site. A busy site with 100K+ daily pageviews will hit these numbers in a month.
The real cost is not disk space — it is query performance. A SELECT that scans 10 million rows to produce a date-range report becomes painfully slow. Indexes help, but they grow too, consuming memory and slowing writes.
MySQL supports range partitioning, which physically divides a table into separate storage segments based on a column value. For analytics, the obvious partition key is the timestamp column.
When you partition the pageviews table by month on server_timestamp, MySQL stores each month's data in a separate physical file. A query with a WHERE server_timestamp BETWEEN '2025-03-01' AND '2025-03-31' clause only reads the March partition — it skips all other months entirely.
You can verify partition pruning with EXPLAIN:
EXPLAIN SELECT * FROM pageviews_partitioned
WHERE server_timestamp BETWEEN '2025-03-01' AND '2025-03-31';
+----+-------------+------------------------+------------+------+...+
| id | select_type | table | partitions | type |...+
+----+-------------+------------------------+------------+------+...+
| 1 | SIMPLE | pageviews_partitioned | p2025_03 | ALL |...+
+----+-------------+------------------------+------------+------+...+
-- "partitions: p2025_03" confirms only one partition is scanned
PRIMARY KEY (id, server_timestamp) rather than just PRIMARY KEY (id). This is a trade-off: you lose the ability to look up a row by id alone, but you gain massive performance improvements for time-range queries.
The retention.sql file contains the full DDL. Here is the key structure:
CREATE TABLE IF NOT EXISTS pageviews_partitioned (
id INT AUTO_INCREMENT,
url VARCHAR(2048) NOT NULL,
type ENUM('pageview','event','error','performance') NOT NULL DEFAULT 'pageview',
user_agent VARCHAR(512),
viewport_width SMALLINT UNSIGNED,
viewport_height SMALLINT UNSIGNED,
referrer VARCHAR(2048),
client_timestamp DATETIME,
server_timestamp DATETIME NOT NULL,
client_ip VARCHAR(45),
session_id VARCHAR(64),
payload JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id, server_timestamp), -- partition key must be in PK
INDEX idx_pv_url (url(255)),
INDEX idx_pv_type (type),
INDEX idx_pv_session_id (session_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (TO_DAYS(server_timestamp)) (
PARTITION p2025_01 VALUES LESS THAN (TO_DAYS('2025-02-01')),
PARTITION p2025_02 VALUES LESS THAN (TO_DAYS('2025-03-01')),
...
PARTITION p2025_12 VALUES LESS THAN (TO_DAYS('2026-01-01')),
PARTITION p_future VALUES LESS THAN MAXVALUE
);
The PARTITION BY RANGE (TO_DAYS(server_timestamp)) clause tells MySQL to use the day number of the timestamp for partitioning. Each partition specifies an upper bound with VALUES LESS THAN.
The p_future catch-all partition ensures that rows with future dates are not rejected. When a new month approaches, you reorganize this partition to carve out a new named partition:
-- Before January 2026 starts, add a new partition:
ALTER TABLE pageviews_partitioned REORGANIZE PARTITION p_future INTO (
PARTITION p2026_01 VALUES LESS THAN (TO_DAYS('2026-02-01')),
PARTITION p_future VALUES LESS THAN MAXVALUE
);
This operation is fast because p_future should contain few or no rows (data for the future has not arrived yet).
PARTITION BY RANGE to an existing table if the primary key does not already include the partition column. Plan for partitioning from the start if possible.
Partitioning makes date-range queries fast, but dashboards often need high-level trends: total pageviews per day, unique sessions, average load time. Running these aggregations on millions of raw rows every time the dashboard loads is wasteful.
The solution is a summary table — a pre-computed, denormalized table that stores one row per day with the key metrics already calculated:
CREATE TABLE IF NOT EXISTS daily_summary (
id INT AUTO_INCREMENT PRIMARY KEY,
day DATE NOT NULL UNIQUE,
total_pageviews INT UNSIGNED NOT NULL DEFAULT 0,
unique_sessions INT UNSIGNED NOT NULL DEFAULT 0,
avg_load_time_ms INT UNSIGNED,
total_errors INT UNSIGNED NOT NULL DEFAULT 0,
top_page VARCHAR(2048),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_ds_day (day)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
A nightly stored procedure populates this table by querying the raw tables and inserting (or updating) the summary row for the previous day. The retention.sql file includes the full aggregate_daily procedure.
| Query Type | Data Source | Example |
|---|---|---|
| Last 30 days trend chart | daily_summary |
30 rows, instant response |
| Year-over-year comparison | daily_summary |
365 rows, still fast |
| Today's live pageview list | pageviews (raw) |
Current partition only |
| Drill-down: specific page today | pageviews (raw) |
Filtered by URL + date |
This two-tier approach is standard in analytics: the dashboard defaults to summary data for speed, and switches to raw data only when the user drills down into a specific time range or filter.
INSERT ... ON DUPLICATE KEY UPDATE so it can be safely re-run for the same day without creating duplicate rows. If the nightly job fails and is retried, it simply overwrites the existing summary row.
Once data has been summarized, you do not need the raw rows forever. Purging old partitions is the fastest way to delete large volumes of data — it is an O(1) metadata operation, not a row-by-row delete:
-- Instantly remove all January 2025 data:
ALTER TABLE pageviews_partitioned DROP PARTITION p2025_01;
Compare this to a traditional DELETE:
-- Slow: scans and deletes row by row, generates undo log
DELETE FROM pageviews
WHERE server_timestamp < '2025-02-01';
-- Could take minutes for millions of rows, locks the table
DROP PARTITION removes the entire physical file in one operation. No row scanning, no undo log, no lock contention. This is one of the strongest practical reasons to use partitioning for analytics tables.
For non-partitioned tables (or tables you cannot easily repartition), use a batched delete loop to avoid long-running locks:
-- Delete in batches of 10,000 to avoid long locks
DELETE FROM pageviews
WHERE server_timestamp < DATE_SUB(NOW(), INTERVAL 6 MONTH)
LIMIT 10000;
-- Run in a loop until 0 rows affected
Manual retention management does not scale. You need automated processes for two recurring tasks:
Nightly aggregation — Summarize yesterday's data into the daily_summary table. This should run after midnight when the previous day's data is complete.
Monthly partition management — Add a new partition before the month starts, and optionally drop partitions older than your retention window (e.g., 6 months).
MySQL has a built-in event scheduler that can run stored procedures on a schedule:
-- Enable the event scheduler (usually in my.cnf for persistence)
SET GLOBAL event_scheduler = ON;
-- Nightly aggregation at 2:00 AM
CREATE EVENT IF NOT EXISTS evt_nightly_aggregate
ON SCHEDULE EVERY 1 DAY
STARTS CURRENT_DATE + INTERVAL 1 DAY + INTERVAL 2 HOUR
DO CALL aggregate_daily(DATE_SUB(CURDATE(), INTERVAL 1 DAY));
If you prefer to keep scheduling outside the database (easier to monitor and log), use a system cron job:
# /etc/cron.d/analytics-retention
# Nightly aggregation at 2:00 AM
0 2 * * * root mysql -u analytics -p'password' analytics_db \
-e "CALL aggregate_daily(DATE_SUB(CURDATE(), INTERVAL 1 DAY));"
# Monthly: add new partition on the 25th of each month
0 3 25 * * root mysql -u analytics -p'password' analytics_db \
-e "ALTER TABLE pageviews_partitioned REORGANIZE PARTITION p_future INTO (...);"
A complete retention strategy ties all three pieces together: