Server-Side Processing

From Raw Beacons to Clean Records

The collector fires JSON payloads at your server. Now what? This phase receives those raw beacons and transforms them into clean, enriched, sessionized records ready for storage. Every beacon must be parsed, validated against an expected schema, enriched with server-side data the client cannot provide (authoritative timestamps, IP addresses, geolocation), linked into sessions, and finally inserted into the database.

Get this phase wrong and your analytics data is unreliable — or worse, an attack vector. Get it right and you have a pipeline that produces trustworthy data at scale.

1. The Server Side of Collection

The browser sends a POST /collect request containing a JSON body. On the server side, that request kicks off a multi-step pipeline before the data ever reaches the database. Each step has a specific job, and if any step fails, the pipeline must handle the failure gracefully without leaking information back to the client.

Browser Server │ │ │── POST /collect ────────────────>│ │ Content-Type: application/json │ │ { url, timestamp, ... } │ │ │ │ ┌───────▼───────┐ │ │ Parse JSON │ │ └───────┬───────┘ │ ┌───────▼───────┐ │ │ Validate │ │ │ field types │ │ │ + sanitize │ │ └───────┬───────┘ │ ┌───────▼───────┐ │ │ Enrich │ │ │ timestamp, │ │ │ IP, geo │ │ └───────┬───────┘ │ ┌───────▼───────┐ │ │ Sessionize │ │ │ link to │ │ │ session_id │ │ └───────┬───────┘ │ ┌───────▼───────┐ │ │ INSERT │ │ │ into MySQL │ │ └───────┬───────┘ │ │ │<──────── 204 No Content ─────────│ │ │

The 204 No Content response is intentional. The server confirms receipt without sending a response body. This is the standard pattern for analytics endpoints — there is nothing useful to return to the client, and a smaller response means less bandwidth and faster completion.

Why 204 and not 200? A 200 OK implies there is a response body worth reading. A 204 No Content explicitly says "I received your data, there is nothing to send back." This is semantically correct for analytics collection and is what navigator.sendBeacon() expects.

2. GET vs POST for Analytics

Analytics data can theoretically be sent via either HTTP method, but the choice has real consequences for security, capacity, and correctness.

Concern	GET	POST
Data location	Query string (`?key=val`)	Request body (JSON)
Size limit	~2,000–8,000 characters (browser/server dependent)	Effectively unlimited (server config)
Visibility	URL visible in server logs, browser history, referer headers	Body not logged by default, not in history
HTTP semantics	GET = retrieve a resource (idempotent, safe)	POST = submit data for processing (correct for analytics)
Caching	Browsers and CDNs may cache GET responses	POST responses are not cached by default
sendBeacon	Not supported	The only method sendBeacon uses

GET is the legacy approach. The classic tracking pixel pattern (<img src="/track.gif?page=/about">) uses GET because it requires no JavaScript. It still works and has its place (email tracking, noscript fallbacks), but it cannot carry rich payloads.

POST is the modern standard. It supports structured JSON bodies, carries no size constraints from the URL, keeps data out of server access logs, and is semantically correct — you are submitting data, not requesting a resource. navigator.sendBeacon() uses POST exclusively.

When GET still makes sense: Server-log-based analytics (no custom endpoint), tracking pixels in email, and environments where JavaScript is unavailable. For everything else, use POST.

3. CORS Configuration

In a typical analytics setup, the collector script runs on yoursite.com but sends beacons to analytics.yoursite.com or a completely different domain. The browser's Same-Origin Policy blocks these cross-origin requests unless the server explicitly allows them via CORS (Cross-Origin Resource Sharing) headers.

The Headers You Need

Access-Control-Allow-Origin: https://yoursite.com
Access-Control-Allow-Methods: POST, OPTIONS
Access-Control-Allow-Headers: Content-Type
Access-Control-Max-Age: 86400

The Preflight Problem

When the browser sends a POST with Content-Type: application/json, it is a "non-simple" request. The browser automatically sends an OPTIONS preflight request first to ask the server for permission. Your endpoint must handle both:

Browser Server │ │ │── OPTIONS /collect ────────────────────>│ │ Origin: https://yoursite.com │ │ Access-Control-Request-Method: POST │ │ Access-Control-Request-Headers: │ │ Content-Type │ │ │ │<── 204 No Content ─────────────────────│ │ Access-Control-Allow-Origin: ... │ │ Access-Control-Allow-Methods: ... │ │ Access-Control-Max-Age: 86400 │ │ │ │── POST /collect ───────────────────────>│ │ Content-Type: application/json │ │ Origin: https://yoursite.com │ │ { ... beacon data ... } │ │ │ │<── 204 No Content ─────────────────────│ │ Access-Control-Allow-Origin: ... │ │ │

Never use Access-Control-Allow-Origin: * in production. A wildcard allows any website to send data to your endpoint. An attacker could flood your database with fake analytics data from their own page. Always whitelist the specific origins you expect.

sendBeacon and preflights: If you send a beacon with Content-Type: text/plain instead of application/json, the browser treats it as a "simple" request and skips the preflight. This halves the number of requests but means you must parse the body manually on the server. Many production collectors use this trick for performance.

4. Validation & Sanitization

Analytics beacons arrive as user-supplied input. An attacker can craft any payload they want using curl, browser DevTools, or a script. If you blindly store this data and later display it in a dashboard, you have a stored XSS vulnerability.

The Validation Pipeline

Raw JSON Input │ ▼ ┌──────────────────────────────────┐ │ 1. Parse JSON │ │ - Reject if not valid JSON │ │ - Reject if not an object │ └──────────────┬───────────────────┘ ▼ ┌──────────────────────────────────┐ │ 2. Schema Check │ │ - Only allow known fields │ │ - Reject unexpected keys │ │ - Check required fields │ └──────────────┬───────────────────┘ ▼ ┌──────────────────────────────────┐ │ 3. Type Validation │ │ - url: string, max 2048 │ │ - timestamp: number (epoch) │ │ - viewport_w: integer > 0 │ │ - event_type: enum whitelist │ └──────────────┬───────────────────┘ ▼ ┌──────────────────────────────────┐ │ 4. Sanitization │ │ - Strip HTML tags from │ │ all string fields │ │ - Truncate long strings │ │ - Normalize URLs │ └──────────────┬───────────────────┘ ▼ ┌──────────────────────────────────┐ │ 5. Use Parameterized Queries │ │ - NEVER concatenate strings │ │ into SQL │ │ - Use prepared statements │ └──────────────────────────────────┘

What to Validate

Field	Type	Validation Rule
`url`	string	Must be a valid URL, max 2048 chars
`referrer`	string	Valid URL or empty string, max 2048 chars
`event_type`	string	Must be one of: `pageview`, `click`, `error`, `performance`
`timestamp`	number	Unix epoch in ms, must be within reasonable range
`viewport_w`	integer	Positive integer, max 10000
`viewport_h`	integer	Positive integer, max 10000
`user_agent`	string	Max 512 chars, strip HTML

Defense in depth: Validation on the server is not optional, even if the collector validates on the client side. The client is untrusted territory — anyone can bypass your JavaScript and send raw HTTP requests directly to your endpoint.

5. Server-Side Enrichment

The client can tell you what page the user is on and what their viewport size is. But some data can only come from the server, and some data the server should override because the client cannot be trusted to provide it accurately.

What the Server Adds

Field	Source	Why Server-Side
`server_timestamp`	`Date.now()` / `time()`	Client clocks can be wrong; server time is authoritative
`client_ip`	Request IP / `X-Forwarded-For`	Only visible to the server; clients cannot self-report IPs
`geo_country`	IP geolocation lookup	Derived from IP; placeholder until a GeoIP database is integrated
`user_hash`	Hash of IP + User-Agent	Privacy-preserving user identifier without cookies
`user_agent`	`User-Agent` header	Server reads the header directly; client-reported value is ignored

Privacy Considerations

Storing raw IP addresses raises privacy concerns (GDPR, CCPA). A common approach is to hash the IP with a daily rotating salt, producing a consistent identifier for sessionization without storing the actual IP:

// Node.js
const crypto = require('crypto');
const dailySalt = getDailySalt(); // rotates every 24 hours
const userHash = crypto
  .createHash('sha256')
  .update(clientIP + userAgent + dailySalt)
  .digest('hex')
  .substring(0, 16);

// PHP
$dailySalt = getDailySalt(); // rotates every 24 hours
$userHash = substr(
    hash('sha256', $clientIP . $userAgent . $dailySalt),
    0, 16
);

Why a daily rotating salt? It prevents long-term tracking while still allowing same-day session linking. After the salt rotates, the same user produces a different hash, making cross-day identification impossible without cookies.

6. Sessionization

A session groups related page views and events from a single user visit. Without sessions, your analytics data is just a flat list of unconnected events. Sessions let you answer questions like "how many pages does a typical user view?" and "where do users enter and exit?"

The 30-Minute Rule

The industry standard (used by Google Analytics, Adobe Analytics, and others) defines a session as ending after 30 minutes of inactivity. If a user views a page, leaves for 25 minutes, and comes back, it is the same session. If they leave for 35 minutes, a new session starts.

User Activity Timeline ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Page A Page B Page C Page D Page E │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ────●───────────●───────────────●─────────────────────────●───────────●──── │ 5min │ 10min │ 45min │ 3min │ │ │ │ (gap!) │ │ └───────────────────────────┘ └───────────┘ Session 1 (sid_abc123) Session 2 (sid_def456) 3 pageviews, 15min duration 2 pageviews, 3min duration

Session Linking Algorithm

Compute user_hash from IP + User-Agent (see Enrichment above)
Query for the most recent event from this user_hash
If found and the last event was less than 30 minutes ago, reuse that session_id
Otherwise, generate a new session_id (UUID v4 or similar)
Store the session_id with the current event

First-Touch Attribution

The first event in a session carries special significance. It records:

Landing page: where the user entered the site
Referrer: where they came from (search engine, social media, direct)
Session start time: used for duration calculations

Sessions without cookies: Because we use a hashed IP + User-Agent identifier, our sessionization works without setting any cookies. This is privacy-friendly and avoids cookie consent requirements in many jurisdictions. The tradeoff is that users behind shared IPs (corporate networks, university WiFi) may be grouped together.

7. Batch vs Real-Time Ingestion

Once beacons are validated, enriched, and sessionized, they need to be inserted into the database. There are two fundamental strategies, each with distinct tradeoffs.

Real-Time (Insert Per Beacon)

Each incoming beacon triggers an immediate INSERT statement. Data is available in the database within milliseconds of the event occurring.

Advantage	Disadvantage
Data available immediately	One DB connection per beacon
Simple implementation	Higher DB load under traffic spikes
No data loss from buffer crashes	Slower individual inserts vs. bulk

Batch (Buffer and Bulk Insert)

Beacons are accumulated in an in-memory buffer (or a file queue) and flushed to the database periodically (e.g., every 5 seconds or every 100 records).

Advantage	Disadvantage
Far fewer DB connections	Data is delayed by the flush interval
Bulk `INSERT` is much faster	Buffer loss on server crash
Handles traffic spikes gracefully	More complex implementation

Real-Time Batch ━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━ Beacon 1 ──> INSERT Beacon 1 ──┐ Beacon 2 ──> INSERT Beacon 2 ──┤ Beacon 3 ──> INSERT Beacon 3 ──┼──> BULK INSERT Beacon 4 ──> INSERT Beacon 4 ──┤ (one statement) Beacon 5 ──> INSERT Beacon 5 ──┘ 5 statements, 5 connections 1 statement, 1 connection

For this project, use real-time insertion. At the traffic volumes you will encounter in a class project, individual inserts are perfectly adequate and far simpler to implement. Batch ingestion is a scaling optimization you would add when insert volume exceeds what a single connection can handle (typically thousands of beacons per second).

8. Error Handling

Your analytics endpoint will encounter errors: malformed JSON, unexpected field types, database connection failures, disk full conditions. The way you handle these errors matters for both reliability and security.

The Golden Rule: Always Return 204

Regardless of what happens internally, your /collect endpoint should return 204 No Content to the client. Never return error details.

Scenario	Internal Action	Client Response
Valid beacon, stored successfully	INSERT into database	`204`
Invalid JSON body	Log error, discard payload	`204`
Validation failure	Log violation, discard payload	`204`
Database connection down	Log error, queue for retry or discard	`204`
Unknown/unexpected error	Log stack trace, discard payload	`204`

Why Not Return Error Codes?

Information leakage: A 400 Bad Request tells an attacker their payload was rejected, helping them refine their attack. A 500 Internal Server Error reveals your server is struggling.
Client cannot act on errors: The browser's sendBeacon does not expose the response status code to JavaScript. Even fetch-based collectors typically ignore the response. There is no point returning errors the client will never read.
Retry storms: If you return 500 and the client retries, a database outage becomes a traffic multiplier. Returning 204 prevents cascading failures.

Log everything server-side. Just because you return 204 to the client does not mean you ignore errors. Log validation failures, database errors, and malformed payloads to a server-side log file or monitoring system. You need this data to debug pipeline issues.

9. Dual Implementation: Node.js & PHP

The tutorials in this section provide both a Node.js (Express) and a PHP (PDO) implementation of the server-side processing pipeline. Both cover the same logic: receive, validate, enrich, sessionize, and store analytics beacons.

Aspect	Node.js (Express)	PHP (PDO)
HTTP handling	`express` middleware	`$_SERVER`, `php://input`
JSON parsing	`express.json()` middleware	`json_decode(file_get_contents('php://input'))`
CORS	`cors` package or manual headers	`header()` calls
Database	`mysql2/promise`	`PDO` with prepared statements
Hashing	`crypto.createHash('sha256')`	`hash('sha256', ...)`
Deployment	Runs as a persistent process (PM2, systemd)	Runs per-request behind Apache/Nginx

You do not need to build both. Choose the language that matches your stack — or build both to compare the execution model differences firsthand. The Node.js endpoint demonstrates the event-loop / long-running process model, while the PHP endpoint demonstrates the start-process-die model. Both are valid architectures for analytics ingestion.

Execution models matter here. In Node.js, your server process stays alive between requests, so you can maintain in-memory buffers, connection pools, and caches. In PHP, each request starts a fresh process (or reuses one via PHP-FPM), so state must come from external stores (database, Redis, files). See the Execution Models overview for a deeper comparison.

Tutorial Modules

01. Node.js Endpoint

Build an Express endpoint that receives, validates, enriches, and stores analytics beacons in MySQL.

Start Module →

02. PHP Endpoint

Build a PHP endpoint using PDO that mirrors the Node.js pipeline with the start-process-die execution model.

Start Module →

03. Validation Deep Dive

Harden your endpoint with schema validation, input sanitization, and defense against stored XSS.

Start Module →

04. Sessionization

Implement session linking with 30-minute timeout, user hashing, and first-touch attribution.

Start Module →