Traditional observability tools assume you have long-running servers. They install agents, collect metrics over time, and track process state. Serverless functions break all of these assumptions.
What’s different about serverless
No persistent process — each invocation is isolated. There’s no “server CPU” to track. The function starts, runs, and stops.
Cold starts are spikes, not failures — a 300ms cold start isn’t an error, it’s expected behavior. Your alerting needs to know the difference.
Correlation is harder — a user request might fan out across 10 functions. Each has its own logs, with no shared process ID to correlate them.
Billing is per-invocation — you need to know which functions are expensive, not just slow.
The three pillars (adapted for serverless)
Logs — the most important. Every invocation should emit a structured log with:
- Request ID (for correlation)
- Function name / service
- Duration
- Status (success/error)
- Custom business fields
Traces — connecting related invocations. A traceId passed through all
function calls lets you reconstruct a user’s full journey.
Metrics — derived from logs, not collected separately. Error rate, p99 latency, and invocation count can all be computed from log events.
Practical setup for Cloudflare Workers
export default {
async fetch(request, env, ctx) {
const traceId = request.headers.get('x-trace-id') ?? crypto.randomUUID();
const start = Date.now();
try {
const response = await handleRequest(request, env, traceId);
ctx.waitUntil(sendLog(env, {
level: 'info',
traceId,
duration: Date.now() - start,
status: response.status,
}));
return response;
} catch (err) {
ctx.waitUntil(sendLog(env, {
level: 'error',
traceId,
duration: Date.now() - start,
error: err.message,
}));
throw err;
}
}
};
What to measure
Focus on these four signals (the “RED” method adapted for serverless):
- Rate — invocations per minute
- Errors — error rate and error messages
- Duration — p50, p95, p99 latency
- Cost — which functions consume the most CPU time
Tooling that understands serverless
Most APM tools were designed before serverless existed. They work, but you’ll fight their assumptions constantly (agents, process metrics, persistent connections).
Tools built for serverless — like ScryWatch — store logs from the invocation boundary, correlate by traceId, and compute metrics from log events rather than process state. This matches how serverless actually works.