Server-Side Processing

From Raw Beacons to Clean Records

The collector fires JSON payloads at your server. Now what? This phase receives those raw beacons and transforms them into clean, enriched, sessionized records ready for storage. Every beacon must be parsed, validated against an expected schema, enriched with server-side data the client cannot provide (authoritative timestamps, IP addresses, geolocation), linked into sessions, and finally inserted into the database.

Get this phase wrong and your analytics data is unreliable — or worse, an attack vector. Get it right and you have a pipeline that produces trustworthy data at scale.

1. The Server Side of Collection

The browser sends a POST /collect request containing a JSON body. On the server side, that request kicks off a multi-step pipeline before the data ever reaches the database. Each step has a specific job, and if any step fails, the pipeline must handle the failure gracefully without leaking information back to the client.

Browser Server │ │ │── POST /collect ────────────────>│ │ Content-Type: application/json │ │ { url, timestamp, ... } │ │ │ │ ┌───────▼───────┐ │ │ Parse JSON │ │ └───────┬───────┘ │ ┌───────▼───────┐ │ │ Validate │ │ │ field types │ │ │ + sanitize │ │ └───────┬───────┘ │ ┌───────▼───────┐ │ │ Enrich │ │ │ timestamp, │ │ │ IP, geo │ │ └───────┬───────┘ │ ┌───────▼───────┐ │ │ Sessionize │ │ │ link to │ │ │ session_id │ │ └───────┬───────┘ │ ┌───────▼───────┐ │ │ INSERT │ │ │ into MySQL │ │ └───────┬───────┘ │ │ │<──────── 204 No Content ─────────│ │ │

The 204 No Content response is intentional. The server confirms receipt without sending a response body. This is the standard pattern for analytics endpoints — there is nothing useful to return to the client, and a smaller response means less bandwidth and faster completion.

Why 204 and not 200? A 200 OK implies there is a response body worth reading. A 204 No Content explicitly says "I received your data, there is nothing to send back." This is semantically correct for analytics collection and is what navigator.sendBeacon() expects.

2. GET vs POST for Analytics

Analytics data can theoretically be sent via either HTTP method, but the choice has real consequences for security, capacity, and correctness.

Concern GET POST
Data location Query string (?key=val) Request body (JSON)
Size limit ~2,000–8,000 characters (browser/server dependent) Effectively unlimited (server config)
Visibility URL visible in server logs, browser history, referer headers Body not logged by default, not in history
HTTP semantics GET = retrieve a resource (idempotent, safe) POST = submit data for processing (correct for analytics)
Caching Browsers and CDNs may cache GET responses POST responses are not cached by default
sendBeacon Not supported The only method sendBeacon uses

GET is the legacy approach. The classic tracking pixel pattern (<img src="/track.gif?page=/about">) uses GET because it requires no JavaScript. It still works and has its place (email tracking, noscript fallbacks), but it cannot carry rich payloads.

POST is the modern standard. It supports structured JSON bodies, carries no size constraints from the URL, keeps data out of server access logs, and is semantically correct — you are submitting data, not requesting a resource. navigator.sendBeacon() uses POST exclusively.

When GET still makes sense: Server-log-based analytics (no custom endpoint), tracking pixels in email, and environments where JavaScript is unavailable. For everything else, use POST.

3. CORS Configuration

In a typical analytics setup, the collector script runs on yoursite.com but sends beacons to analytics.yoursite.com or a completely different domain. The browser's Same-Origin Policy blocks these cross-origin requests unless the server explicitly allows them via CORS (Cross-Origin Resource Sharing) headers.

The Headers You Need

Access-Control-Allow-Origin: https://yoursite.com
Access-Control-Allow-Methods: POST, OPTIONS
Access-Control-Allow-Headers: Content-Type
Access-Control-Max-Age: 86400

The Preflight Problem

When the browser sends a POST with Content-Type: application/json, it is a "non-simple" request. The browser automatically sends an OPTIONS preflight request first to ask the server for permission. Your endpoint must handle both:

Browser Server │ │ │── OPTIONS /collect ────────────────────>│ │ Origin: https://yoursite.com │ │ Access-Control-Request-Method: POST │ │ Access-Control-Request-Headers: │ │ Content-Type │ │ │ │<── 204 No Content ─────────────────────│ │ Access-Control-Allow-Origin: ... │ │ Access-Control-Allow-Methods: ... │ │ Access-Control-Max-Age: 86400 │ │ │ │── POST /collect ───────────────────────>│ │ Content-Type: application/json │ │ Origin: https://yoursite.com │ │ { ... beacon data ... } │ │ │ │<── 204 No Content ─────────────────────│ │ Access-Control-Allow-Origin: ... │ │ │
Never use Access-Control-Allow-Origin: * in production. A wildcard allows any website to send data to your endpoint. An attacker could flood your database with fake analytics data from their own page. Always whitelist the specific origins you expect.

sendBeacon and preflights: If you send a beacon with Content-Type: text/plain instead of application/json, the browser treats it as a "simple" request and skips the preflight. This halves the number of requests but means you must parse the body manually on the server. Many production collectors use this trick for performance.

4. Validation & Sanitization

Analytics beacons arrive as user-supplied input. An attacker can craft any payload they want using curl, browser DevTools, or a script. If you blindly store this data and later display it in a dashboard, you have a stored XSS vulnerability.

The Validation Pipeline

Raw JSON Input │ ▼ ┌──────────────────────────────────┐ │ 1. Parse JSON │ │ - Reject if not valid JSON │ │ - Reject if not an object │ └──────────────┬───────────────────┘ ▼ ┌──────────────────────────────────┐ │ 2. Schema Check │ │ - Only allow known fields │ │ - Reject unexpected keys │ │ - Check required fields │ └──────────────┬───────────────────┘ ▼ ┌──────────────────────────────────┐ │ 3. Type Validation │ │ - url: string, max 2048 │ │ - timestamp: number (epoch) │ │ - viewport_w: integer > 0 │ │ - event_type: enum whitelist │ └──────────────┬───────────────────┘ ▼ ┌──────────────────────────────────┐ │ 4. Sanitization │ │ - Strip HTML tags from │ │ all string fields │ │ - Truncate long strings │ │ - Normalize URLs │ └──────────────┬───────────────────┘ ▼ ┌──────────────────────────────────┐ │ 5. Use Parameterized Queries │ │ - NEVER concatenate strings │ │ into SQL │ │ - Use prepared statements │ └──────────────────────────────────┘

What to Validate

Field Type Validation Rule
url string Must be a valid URL, max 2048 chars
referrer string Valid URL or empty string, max 2048 chars
event_type string Must be one of: pageview, click, error, performance
timestamp number Unix epoch in ms, must be within reasonable range
viewport_w integer Positive integer, max 10000
viewport_h integer Positive integer, max 10000
user_agent string Max 512 chars, strip HTML
Defense in depth: Validation on the server is not optional, even if the collector validates on the client side. The client is untrusted territory — anyone can bypass your JavaScript and send raw HTTP requests directly to your endpoint.

5. Server-Side Enrichment

The client can tell you what page the user is on and what their viewport size is. But some data can only come from the server, and some data the server should override because the client cannot be trusted to provide it accurately.

What the Server Adds

Field Source Why Server-Side
server_timestamp Date.now() / time() Client clocks can be wrong; server time is authoritative
client_ip Request IP / X-Forwarded-For Only visible to the server; clients cannot self-report IPs
geo_country IP geolocation lookup Derived from IP; placeholder until a GeoIP database is integrated
user_hash Hash of IP + User-Agent Privacy-preserving user identifier without cookies
user_agent User-Agent header Server reads the header directly; client-reported value is ignored

Privacy Considerations

Storing raw IP addresses raises privacy concerns (GDPR, CCPA). A common approach is to hash the IP with a daily rotating salt, producing a consistent identifier for sessionization without storing the actual IP:

// Node.js
const crypto = require('crypto');
const dailySalt = getDailySalt(); // rotates every 24 hours
const userHash = crypto
  .createHash('sha256')
  .update(clientIP + userAgent + dailySalt)
  .digest('hex')
  .substring(0, 16);
// PHP
$dailySalt = getDailySalt(); // rotates every 24 hours
$userHash = substr(
    hash('sha256', $clientIP . $userAgent . $dailySalt),
    0, 16
);
Why a daily rotating salt? It prevents long-term tracking while still allowing same-day session linking. After the salt rotates, the same user produces a different hash, making cross-day identification impossible without cookies.

6. Sessionization

A session groups related page views and events from a single user visit. Without sessions, your analytics data is just a flat list of unconnected events. Sessions let you answer questions like "how many pages does a typical user view?" and "where do users enter and exit?"

The 30-Minute Rule

The industry standard (used by Google Analytics, Adobe Analytics, and others) defines a session as ending after 30 minutes of inactivity. If a user views a page, leaves for 25 minutes, and comes back, it is the same session. If they leave for 35 minutes, a new session starts.

User Activity Timeline ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Page A Page B Page C Page D Page E │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ────●───────────●───────────────●─────────────────────────●───────────●──── │ 5min │ 10min │ 45min │ 3min │ │ │ │ (gap!) │ │ └───────────────────────────┘ └───────────┘ Session 1 (sid_abc123) Session 2 (sid_def456) 3 pageviews, 15min duration 2 pageviews, 3min duration

Session Linking Algorithm

  1. Compute user_hash from IP + User-Agent (see Enrichment above)
  2. Query for the most recent event from this user_hash
  3. If found and the last event was less than 30 minutes ago, reuse that session_id
  4. Otherwise, generate a new session_id (UUID v4 or similar)
  5. Store the session_id with the current event

First-Touch Attribution

The first event in a session carries special significance. It records:

Sessions without cookies: Because we use a hashed IP + User-Agent identifier, our sessionization works without setting any cookies. This is privacy-friendly and avoids cookie consent requirements in many jurisdictions. The tradeoff is that users behind shared IPs (corporate networks, university WiFi) may be grouped together.

7. Batch vs Real-Time Ingestion

Once beacons are validated, enriched, and sessionized, they need to be inserted into the database. There are two fundamental strategies, each with distinct tradeoffs.

Real-Time (Insert Per Beacon)

Each incoming beacon triggers an immediate INSERT statement. Data is available in the database within milliseconds of the event occurring.

Advantage Disadvantage
Data available immediately One DB connection per beacon
Simple implementation Higher DB load under traffic spikes
No data loss from buffer crashes Slower individual inserts vs. bulk

Batch (Buffer and Bulk Insert)

Beacons are accumulated in an in-memory buffer (or a file queue) and flushed to the database periodically (e.g., every 5 seconds or every 100 records).

Advantage Disadvantage
Far fewer DB connections Data is delayed by the flush interval
Bulk INSERT is much faster Buffer loss on server crash
Handles traffic spikes gracefully More complex implementation
Real-Time Batch ━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━ Beacon 1 ──> INSERT Beacon 1 ──┐ Beacon 2 ──> INSERT Beacon 2 ──┤ Beacon 3 ──> INSERT Beacon 3 ──┼──> BULK INSERT Beacon 4 ──> INSERT Beacon 4 ──┤ (one statement) Beacon 5 ──> INSERT Beacon 5 ──┘ 5 statements, 5 connections 1 statement, 1 connection
For this project, use real-time insertion. At the traffic volumes you will encounter in a class project, individual inserts are perfectly adequate and far simpler to implement. Batch ingestion is a scaling optimization you would add when insert volume exceeds what a single connection can handle (typically thousands of beacons per second).

8. Error Handling

Your analytics endpoint will encounter errors: malformed JSON, unexpected field types, database connection failures, disk full conditions. The way you handle these errors matters for both reliability and security.

The Golden Rule: Always Return 204

Regardless of what happens internally, your /collect endpoint should return 204 No Content to the client. Never return error details.

Scenario Internal Action Client Response
Valid beacon, stored successfully INSERT into database 204
Invalid JSON body Log error, discard payload 204
Validation failure Log violation, discard payload 204
Database connection down Log error, queue for retry or discard 204
Unknown/unexpected error Log stack trace, discard payload 204

Why Not Return Error Codes?

Log everything server-side. Just because you return 204 to the client does not mean you ignore errors. Log validation failures, database errors, and malformed payloads to a server-side log file or monitoring system. You need this data to debug pipeline issues.

9. Dual Implementation: Node.js & PHP

The tutorials in this section provide both a Node.js (Express) and a PHP (PDO) implementation of the server-side processing pipeline. Both cover the same logic: receive, validate, enrich, sessionize, and store analytics beacons.

Aspect Node.js (Express) PHP (PDO)
HTTP handling express middleware $_SERVER, php://input
JSON parsing express.json() middleware json_decode(file_get_contents('php://input'))
CORS cors package or manual headers header() calls
Database mysql2/promise PDO with prepared statements
Hashing crypto.createHash('sha256') hash('sha256', ...)
Deployment Runs as a persistent process (PM2, systemd) Runs per-request behind Apache/Nginx

You do not need to build both. Choose the language that matches your stack — or build both to compare the execution model differences firsthand. The Node.js endpoint demonstrates the event-loop / long-running process model, while the PHP endpoint demonstrates the start-process-die model. Both are valid architectures for analytics ingestion.

Execution models matter here. In Node.js, your server process stays alive between requests, so you can maintain in-memory buffers, connection pools, and caches. In PHP, each request starts a fresh process (or reuses one via PHP-FPM), so state must come from external stores (database, Redis, files). See the Execution Models overview for a deeper comparison.

Tutorial Modules

01. Node.js Endpoint

Build an Express endpoint that receives, validates, enriches, and stores analytics beacons in MySQL.

Start Module →

02. PHP Endpoint

Build a PHP endpoint using PDO that mirrors the Node.js pipeline with the start-process-die execution model.

Start Module →

03. Validation Deep Dive

Harden your endpoint with schema validation, input sanitization, and defense against stored XSS.

Start Module →

04. Sessionization

Implement session linking with 30-minute timeout, user hashing, and first-touch attribution.

Start Module →