CloudflareRead originalSeptember 26, 2025

How Cloudflare Cut Cold Starts 10x: From TLS Pre-Warming to Consistent Hash Sharding

14 min read

Cloudflare Workers started with 5ms cold starts that were hidden behind TLS handshakes. As Workers grew to support full applications (10 MB scripts, 400ms startup budgets), cold starts outgrew TLS - and the original trick stopped working. This post covers both generations of their solution: the TLS SNI pre-warming trick and the consistent hash ring sharding system that ultimately cut eviction rates 10x and pushed warm request rates to 99.99%.

Ready to ace your system design interview?

This article is just one piece. SWE Quiz gives you structured, interview-focused practice across every topic that comes up in senior engineering rounds.

1,000+ quiz questions across system design and ML/AI
Spaced repetition to lock in what you learn
Full case study walkthroughs of real interview topics
Track streaks, XP, and progress over time

Start for free See pricing

TL;DR

Generation 1: Cloudflare pre-warms Workers during the TLS handshake by reading the SNI hostname from the ClientHello - hiding the cold start behind latency the client already has to wait for.
Generation 2: As Workers grew larger (scripts up to 10 MB, startup CPU up to 400ms), cold starts outgrew TLS handshakes - so Cloudflare instead routes requests to servers that already have the Worker loaded.
A consistent hash ring maps Worker script IDs to "shard servers." Requests are forwarded there (<1ms overhead) rather than cold-starting a new instance on whichever server received the request.
When a shard server is overloaded, it uses Cap'n Proto's distributed capability model to return the client's own lazy Worker reference - short-circuiting the proxy loop without an extra round trip.
Sharding just 4% of traffic reduced global eviction rates by 10x, pushing warm request rates from 99.9% to 99.99% - thanks to power law distributions where a small number of high-traffic Workers dominate total request volume.

Why this matters for interviews

Serverless cold starts, consistent hashing, and load shedding are all recurring system design topics. This post gives you a real production example of consistent hashing applied to something other than caches, plus a concrete case study in how to reason about optimistic vs. pessimistic request handling under overload.

Breakdown

1.What's a cold start and why does it hurt?

A cold start is everything that must happen before a serverless function can serve its first request: fetching the script from storage, compiling the source code, executing the top-level module (running initialization logic), and finally invoking the function to handle the incoming request. For Cloudflare Workers - which run inside V8 isolates rather than full containers - this was originally measured in single-digit milliseconds. Containers on other platforms can take full seconds to spin up a new process. V8 isolates share a single process, skipping OS-level overhead, so startup is dramatically cheaper. But even 5ms adds up when you have millions of low-traffic Workers that are frequently evicted and re-instantiated.

Interview angle: When asked about serverless tradeoffs, cold starts are the canonical downside. Know the spectrum: containers (seconds), VMs (minutes), isolates (milliseconds). The follow-up is always "how do you mitigate them?" - which leads directly into the rest of this article.

2.Generation 1: Pre-warming during the TLS handshake

The first technique exploited a timing window that already exists in every HTTPS request. Before a client can send an HTTP request, it must complete a TLS handshake with the server. The very first message the client sends - the ClientHello - includes the SNI (Server Name Indication): the hostname the client is connecting to. For a Worker, that hostname is enough information to identify which script needs to run. Cloudflare's insight: the time the server spends completing the handshake (certificate exchange, key agreement) is time the client is already waiting for. If the cold start finishes before the handshake does, the request sees zero latency from it. In 2020, Worker cold starts were ~5ms and average network round-trip times exceeded that - so reading the SNI from the ClientHello and immediately starting to warm the Worker in the background made cold starts effectively invisible.

Timeline diagram showing Client and Cloudflare Server tracks. ClientHello triggers background Worker pre-warm; the Worker is warm before the handshake completes, so the request sees zero cold start delay. — Worker pre-warming starts the moment the ClientHello arrives - before the handshake even completes. By the time the HTTP request lands, the Worker is already hot.

Interview angle: "Work you do during unavoidable waiting time is free" is a powerful optimization pattern. TLS handshake pre-warming is one example; speculative execution in CPUs is another. Interview question: what are the failure modes? Answer: if cold start > handshake duration, you've helped but not eliminated the delay. If the Worker changes between ClientHello and request arrival, you've wasted work.

3.Why the trick stopped working: Workers grew up

The TLS pre-warming technique depended on one assumption: cold starts finish faster than TLS handshakes. That assumption eroded over time as Cloudflare relaxed platform limits to support real applications. Compressed script size limits went from 1 MB to 10 MB for paid users. Startup CPU time limits went from 200ms to 400ms. Users took full advantage - deploying complex bundled applications that previously would have lived on origin servers. Cold starts for these Workers now exceeded TLS handshake durations, especially with TLS 1.3 (which only needs a single round trip, making the pre-warming window even shorter). The result: any optimization to cold start speed itself was a losing battle. Instead of making cold starts faster, Cloudflare needed to make them rarer.

Interview angle: This is a classic "assumptions drift" problem. An optimization that works at one scale breaks when the system's parameters change. Interview signal: when an interviewer describes a technique that "used to work," probe what assumption changed. Here it was: cold start duration < handshake duration. The corrective insight - reduce frequency instead of duration - is the kind of lateral thinking that distinguishes strong candidates.

4.The core insight: route to warm Workers instead of starting new ones

Consider a Worker that receives one request per minute, spread evenly across 300 servers in a data center. Each server receives that Worker's traffic roughly once every five hours. Workers are evicted after inactivity to free memory for other Workers - so with five-hour gaps between requests, the Worker is almost always cold. Cold start rate approaches 100%. Now imagine routing all of that traffic to a single server. The Worker receives one request per minute on that one server - a short enough interval to stay warm indefinitely. Memory usage drops by 99.7% (one instance instead of 300). The forwarding cost? Less than one millisecond for an in-datacenter proxy hop. That's always cheaper than a cold start. Cloudflare calls this "Worker sharding" - the idea that each data center now has one logical pool of Workers, sharded across servers, rather than each server maintaining an independent set.

Interview angle: The key numbers: <1ms proxy overhead vs. potentially hundreds of milliseconds for a cold start. Coalescing traffic to warm instances is almost always the right call. Secondary benefit: memory efficiency creates a virtuous cycle - fewer evictions means other Workers also stay warm longer. In interviews, model this as a cache hit rate problem: the goal is to minimize cold start (cache miss) rate, and routing is your lever.

5.Consistent hash ring: deterministically finding the shard server

To route requests to the server that already has a Worker loaded, every server needs to agree on which server "owns" a given Worker. Cloudflare uses a consistent hash ring. Both Worker script IDs and server addresses are hashed onto a single number line whose ends wrap around (a ring). To find a Worker's home server, hash its script ID, find that position on the ring, then take the next server address clockwise - that's the shard server. The elegance is in how it handles servers joining and leaving. With a naive hash table (script_id → server_index), adding or removing any server rehashes every Worker, causing a global cold start wave. With a consistent hash ring, only the Workers that were "owned" by the changed server need to be re-homed - all others are unaffected. In practice, a server leaving or crashing only displaces a fraction of Workers.

Two consistent hash rings side by side. Left: four servers and several Worker script IDs mapped onto the ring, each Worker pointing clockwise to its home server. Right: one server removed - only the Workers that were adjacent to it get re-homed to the next server; all others are unchanged. — Left: each Worker is owned by the next server clockwise on the ring. Right: when Server B leaves, only its Workers re-home to Server C - no global rehash.

Interview angle: Consistent hashing is the most commonly tested distributed systems concept in system design interviews. Most candidates apply it to caches (CDN, Memcached). This is a rarer but equally valid application to compute routing. Know the two failure modes: server churn causes re-homing (acceptable), and hash skew can make one server own too much (fixed with virtual nodes - worth mentioning).

6.Handling overload: optimistic sharding and graceful load shedding

Workers are live compute, not static files. A single high-traffic Worker may need dozens of instances to handle its load - so Cloudflare can never guarantee a Worker lives on exactly one server. The shard server must be able to refuse requests when overloaded, and the shard client (the server that received the request and forwarded it) must handle refusals gracefully. Cloudflare chose optimistic sending: just forward the request without asking permission first. The alternative - a round-trip "may I?" before sending - costs one extra RTT of latency, and since the vast majority of sharded requests succeed, that cost is almost always wasted. When a shard server does need to refuse, it uses Cap'n Proto's capability model: instead of sending a "go away" error, it returns the shard client's own lazy Worker reference back to the client. The shard client's RPC layer recognizes the returned capability as local, stops sending bytes to the shard server, and serves the request itself - no trombone effect, no extra RTT.

Two side-by-side request flow diagrams. Happy path: request arrives at shard client, forwarded to shard server which has the Worker warm, response returned. Overload path: shard client forwards to shard server, shard server returns the client's own lazy capability, shard client short-circuits and serves the request locally via a cold start. — Happy path: the shard server has the Worker warm and handles the request. Overload path: the shard server returns the client's own lazy capability - the RPC layer recognizes it as local and short-circuits, keeping the shard server entirely out of the response path.

Interview angle: "Optimistic vs. pessimistic" is a fundamental distributed systems tradeoff. Optimistic is faster in the common case but requires a well-designed rollback/refusal path. The Cap'n Proto trick here is particularly elegant: the refusal mechanism is zero-cost for the common success case, and the fallback is handled automatically by the RPC layer without any application-level logic.

7.Workers invoking Workers: propagating context across machines

Many Cloudflare products involve chains of Worker invocations. Service Bindings let one Worker call another directly. Workers for Platforms runs a dispatch Worker that selects and invokes a user Worker, which may call an outbound Worker. Tail Workers collect traces. All of these can be sharded to different servers. The challenge: each invocation carries a context stack - ownership overrides, trust levels, resource limits, tail Worker configs, outbound Worker configs, feature flags. When everything ran on a single thread, this was manageable. For sharding to work, the context must travel with the request across machines. Cloudflare's solution: serialize the context stack into a Cap'n Proto message and send it to the shard server. Cap'n Proto capabilities (live object references) are particularly useful here - a reportTraces() callback on the dispatch Worker's home server can be embedded in the serialized context, and any shard server that ends up with a piece of the request chain can call it directly, without knowing where "home" is.

Interview angle: Distributed context propagation is the operational name for this problem - it's the same challenge behind distributed tracing (OpenTelemetry, Jaeger), request-scoped auth tokens, and feature flag propagation across microservices. The pattern is always: serialize the context at the boundary, deserialize at the destination, use stable identifiers (not in-memory pointers) to refer to things that need callbacks.

8.Results: 10x fewer evictions, four nines of warm requests

After full rollout, only about 4% of enterprise traffic was sharded - meaning 96% of Workers already had enough concurrent traffic to require multiple instances. Yet that 4% drove a 10x reduction in global Worker eviction rates. The explanation is power law distributions: internet traffic is heavily skewed. A small number of high-traffic Workers account for the vast majority of requests. They were never a cold start problem. The long tail - a huge number of low-traffic Workers that receive rare, sporadic requests - caused nearly all the evictions. Sharding coalesces that long tail onto single instances, keeping them warm. The warm request rate for enterprise traffic improved from 99.9% (three nines) to 99.99% (four nines). Equivalently, cold start rate dropped from 0.1% to 0.01% of requests - a 10x decrease consistent with the eviction rate improvement.

Interview angle: Power law distributions appear constantly in systems: web traffic, database query frequencies, cache key access patterns, error rates. The key interview insight here is that optimizing the 4% of traffic that follows the power law tail can have disproportionate system-wide impact. When asked "would this optimization be worth it?", the answer often depends on the distribution of your workload - uniform distributions rarely give this kind of leverage.

Key Concepts

Cold Start

The latency incurred the first time a serverless function is invoked after a period of inactivity. Includes fetching script code, compiling it, executing initialization logic, and handling the first request. Subsequent requests to the same loaded instance are "warm" and skip all of this.

Analogy: Like starting a car that's been sitting in a cold garage vs. one with the engine already running. The warm car moves the instant you shift into drive.

V8 Isolate

A lightweight JavaScript execution context inside the V8 engine. Multiple isolates share a single OS process and V8 heap, so creating a new one skips the process-spawning overhead of containers. This is what gives Cloudflare Workers millisecond-scale cold starts instead of second-scale.

Analogy: Like opening a new browser tab vs. launching a new browser. New tab: instant. New browser: takes a few seconds.

TLS SNI (Server Name Indication)

An extension to the TLS protocol where the client includes the target hostname in its very first message (the ClientHello). Allows servers to serve multiple certificates from one IP. Cloudflare exploits this to identify which Worker to pre-warm before the handshake completes.

Consistent Hash Ring

A data structure that maps both keys (e.g., Worker script IDs) and nodes (e.g., server addresses) onto a circular number line. Each key is "owned" by the next node clockwise. When nodes are added or removed, only keys adjacent to the change are affected - unlike a naive hash table where all keys must be remapped.

Analogy: Like assigning postal routes on a circular road. If a delivery driver drops out, only the letters just before their stretch of road need to be reassigned to the next driver. Everyone else keeps their existing route.

Shard Client / Shard Server

In Cloudflare's sharding model, the shard client is the server that receives the incoming request. The shard server is the server the hash ring designates as the Worker's home - the one likely to already have it loaded. The shard client forwards the request to the shard server to avoid a cold start.

Load Shedding

Deliberately refusing or rerouting requests when a server is overloaded, to prevent cascading failure. The goal is graceful degradation - shedding excess load before it causes errors for all requests, not just the excess ones.

Cap'n Proto RPC

A serialization and remote procedure call framework that supports distributed object capabilities - references to live objects that can be passed across machines. Cloudflare uses it for all cross-instance Worker communication, including the load shedding refusal trick and context stack serialization.

Power Law Distribution

A statistical pattern where a small number of items account for a disproportionately large share of events. Web traffic follows a power law: a handful of Workers handle the majority of requests, while a long tail of Workers each handle very little traffic. This long tail is where cold starts concentrate.

Analogy: "The rich get richer." A few words like "the" and "and" appear thousands of times in any text; most words appear once or twice.

Knowledge Check

10 questions — your answers are saved locally so you can come back anytime.