Module 03: Validation & Sanitization

Analytics endpoints are public — anyone can POST data to them. Without validation, your database becomes a vector for stored XSS, SQL injection, and garbage data. This module builds reusable validation middleware for both Node.js and PHP.

Demo Files

validate.js — Node.js validation middleware
validate.php — PHP validation function library

1. Analytics Data as Attack Vector

Your analytics endpoint accepts untrusted input and stores it. Later, your dashboard reads that data and renders it in HTML. This creates a classic stored XSS pipeline: an attacker sends a <script> tag inside a URL or user-agent field, your endpoint saves it as-is, and your dashboard injects it into the page without escaping.

Attacker Endpoint Database Dashboard ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ POST │ │ │ │ │ │ │ │ url: │────────────>│ Store │──────────>│ <script> │──────────>│ Renders │ │ <script> │ │ as-is? │ │ saved │ │ XSS! │ └──────────┘ └──────────┘ └──────────┘ └──────────┘

The fix is to validate and sanitize at ingest time, before data ever reaches storage. Here are the common attack vectors for analytics fields:

Field	Attack Type	Mitigation
`url`	Stored XSS via `<script>` in URL string	Validate URL format, HTML-encode entities
`referrer`	Stored XSS, open redirect payloads	Validate URL format, HTML-encode entities
`userAgent`	Stored XSS via crafted UA string	HTML-encode, truncate to max length
`sessionId`	SQL injection, NoSQL injection	Restrict to alphanumeric + hyphens only
`type`	Arbitrary string injection	Allowlist of known values
`viewportWidth/Height`	Type confusion, overflow	Parse as integer, clamp to range
`timestamp`	Format confusion, injection via date strings	Validate ISO 8601 format

Never trust client data. Even fields that look harmless (like viewportWidth) can carry unexpected types. A number field set to the string "999999999999999999999" can cause integer overflow. A boolean field set to "__proto__" can trigger prototype pollution in careless JavaScript code.

2. Field-by-Field Validation

Every field in your analytics payload needs a specific validation rule. Here is the complete specification for the beacon format used in this tutorial series:

Field	Type	Rule	Example
`url`	string	Valid URL format (`http://` or `https://`), max 2048 chars	`https://example.com/about`
`type`	string	Enum allowlist: `pageview`, `event`, `error`, `performance`	`pageview`
`timestamp`	string	ISO 8601 format	`2026-02-17T10:30:00.000Z`
`viewportWidth`	number	Integer, range 0–32767	`1920`
`viewportHeight`	number	Integer, range 0–32767	`1080`
`userAgent`	string	Max 512 chars	`Mozilla/5.0 (Windows NT 10.0; Win64; x64)...`
`referrer`	string	Valid URL format, max 2048 chars	`https://google.com/search?q=example`
`sessionId`	string	Alphanumeric + hyphens/underscores, max 64 chars	`a1b2c3d4-e5f6-7890`

Why 2048 for URLs? While the HTTP spec does not define a maximum URL length, 2048 characters is the practical limit enforced by most browsers (Internet Explorer originally set this limit, and it stuck). Anything longer is almost certainly malicious or malformed.

3. Sanitization

Validation rejects bad data. Sanitization cleans data that passes validation but might still contain dangerous characters. The three core sanitization operations:

HTML Entity Encoding

Replace characters that have special meaning in HTML with their entity equivalents. This prevents stored XSS when the data is later rendered in a dashboard:

Character    Entity        Why
&            &amp;         Starts an HTML entity
<            &lt;          Opens an HTML tag
>            &gt;          Closes an HTML tag
"            &quot;        Breaks out of attribute values
'            &#39;         Breaks out of single-quoted attributes

After encoding, <script>alert(1)</script> becomes the harmless string <script>alert(1)</script> that renders as visible text, not executable code.

Stripping Control Characters

Control characters (ASCII 0x00 through 0x1F and 0x7F) have no place in analytics data. They can break parsers, corrupt log files, and serve as payloads for certain injection attacks:

// JavaScript: strip control characters
str.replace(/[\x00-\x1f\x7f]/g, '');

// PHP: strip control characters
preg_replace('/[\x00-\x1f\x7f]/', '', $str);

Truncating to Max Lengths

Even after encoding, enforce maximum lengths. A 1MB URL string will pass URL validation but should not be stored:

// JavaScript
str.substring(0, maxLen);

// PHP (multibyte-safe)
mb_substr($str, 0, $maxLen, 'UTF-8');

Order matters. Always truncate before encoding. If you encode first and then truncate, you might cut an entity in half: &amp instead of &, which corrupts the output.

4. Node.js Implementation

The validate.js module exports a single function: validateBeacon(data). It takes a raw parsed JSON object and returns either a sanitized object or null if the data is invalid.

const { validateBeacon } = require('./validate');

// In your Express handler:
app.post('/collect', (req, res) => {
    const clean = validateBeacon(req.body);
    if (!clean) {
        return res.status(400).json({ error: 'Invalid beacon data' });
    }
    // clean is safe to store
    fs.appendFile(LOG_FILE, JSON.stringify(clean) + '\n', (err) => {
        if (err) return res.sendStatus(500);
        res.sendStatus(204);
    });
});

Walk through the key functions in validate.js:

`validateBeacon(data)`

The main entry point. Returns null immediately if data is not an object. Checks that url is present and matches the URL pattern. Then builds a clean output object by running each field through its specific validator:

function validateBeacon(data) {
    if (!data || typeof data !== 'object') return null;

    const allowedTypes = ['pageview', 'event', 'error', 'performance'];
    const urlPattern = /^https?:\/\/.{1,2048}$/;

    // URL is the only required field
    if (typeof data.url !== 'string' || !urlPattern.test(data.url)) return null;

    return {
        url: sanitize(data.url, 2048),
        type: allowedTypes.includes(data.type) ? data.type : 'pageview',
        userAgent: sanitize(data.userAgent || '', 512),
        viewportWidth: clampInt(data.viewportWidth, 0, 32767),
        viewportHeight: clampInt(data.viewportHeight, 0, 32767),
        referrer: sanitize(data.referrer || '', 2048),
        timestamp: isISO8601(data.timestamp) ? data.timestamp : null,
        sessionId: sanitizeId(data.sessionId || '', 64),
        payload: data.payload || null,
    };
}

`sanitize(str, maxLen)`

Truncates, HTML-encodes, and strips control characters in a single pass:

function sanitize(str, maxLen) {
    return String(str)
        .substring(0, maxLen)
        .replace(/[<>&"']/g, c => ({
            '<':'&lt;', '>':'&gt;', '&':'&amp;',
            '"':'&quot;', "'":'&#39;'
        }[c]))
        .replace(/[\x00-\x1f\x7f]/g, '');
}

`clampInt(val, min, max)`

Parses a value as an integer and clamps it to a range. Returns null if the value is not a valid integer:

function clampInt(val, min, max) {
    const n = parseInt(val, 10);
    if (isNaN(n)) return null;
    return Math.max(min, Math.min(max, n));
}

`sanitizeId(str, maxLen)`

Strips everything except alphanumeric characters, hyphens, and underscores. This is the strictest sanitizer — session IDs should never contain HTML, SQL, or shell metacharacters:

function sanitizeId(str, maxLen) {
    return String(str).substring(0, maxLen).replace(/[^a-zA-Z0-9\-_]/g, '');
}

5. PHP Implementation

The validate.php library mirrors the Node.js version exactly. PHP has built-in functions that make validation more concise in some cases:

require_once 'validate.php';

// In your endpoint:
$raw = json_decode(file_get_contents('php://input'), true);
$clean = validateBeacon($raw ?? []);
if ($clean === null) {
    http_response_code(400);
    echo json_encode(['error' => 'Invalid beacon data']);
    exit;
}
// $clean is safe to store

Key differences from the Node.js version:

Operation	Node.js	PHP
URL validation	Regex pattern	`filter_var($url, FILTER_VALIDATE_URL)`
HTML encoding	Manual `.replace()`	`htmlspecialchars()`
Integer validation	`parseInt()` + `isNaN()`	`filter_var($val, FILTER_VALIDATE_INT)`
String truncation	`.substring()`	`mb_substr()` (multibyte-safe)
Regex	`/pattern/.test(str)`	`preg_match('/pattern/', $str)`

PHP's filter_var() with FILTER_VALIDATE_URL is stricter than a simple regex — it checks scheme, host, and path components. PHP's htmlspecialchars() with ENT_QUOTES | ENT_HTML5 handles all five dangerous characters in one call, whereas the Node.js version uses a manual replacement map.

6. Testing Malicious Input

Your validation code is only as good as the payloads you test it against. Here are examples of malicious inputs and what each implementation should do with them:

Stored XSS in URL

{
    "url": "https://example.com/<script>alert('xss')</script>",
    "type": "pageview"
}

Result: The url field is sanitized. The < and > characters are encoded to < and >. The script tag becomes inert text.

SQL Injection in Session ID

{
    "url": "https://example.com/",
    "type": "pageview",
    "sessionId": "'; DROP TABLE beacons; --"
}

Result: The sanitizeId function strips everything except [a-zA-Z0-9\-_]. The output is DROPTABLEbeacons — harmless gibberish.

Integer Overflow

{
    "url": "https://example.com/",
    "type": "pageview",
    "viewportWidth": 99999999999
}

Result: clampInt clamps the value to the maximum of 32767. Absurd values are silently normalized.

Invalid Type with Fallback

{
    "url": "https://example.com/",
    "type": "malicious_custom_type"
}

Result: The type is not in the allowlist, so it defaults to "pageview". Unknown types are never stored.

Control Characters

{
    "url": "https://example.com/",
    "type": "pageview",
    "userAgent": "Mozilla/5.0\x00\x01\x02 injected"
}

Result: Control characters (0x00 through 0x1F) are stripped. The output is Mozilla/5.0 injected.

Missing Required Field

{
    "type": "pageview",
    "viewportWidth": 1920
}

Result: validateBeacon returns null. The endpoint responds with 400 Bad Request. No data is stored.

Test with real attack payloads. Resources like the OWASP XSS Filter Evasion Cheat Sheet contain hundreds of XSS vectors. Run them through your validator to make sure nothing gets through. Automated tools like sqlmap and Burp Suite can also test your endpoint for injection vulnerabilities.

← Previous: PHP Endpoint Next: Sessionization →