InfrastructureEngineering Log March 11, 2020 8 min read

Cold Starts and Kubernetes: Managing Latency in Burst Traffic

A model-backed API dropped requests during a burst even while Kubernetes was scaling. The fix was making model readiness explicit, blocking traffic from cold pods, adding ingress rate limits, and treating autoscaling as delayed capacity instead of instant rescue.

KubernetesAutoscalingCold StartsFastAPINginxModel ServingReadiness ProbesBackend Architecture

How’s everyone doing?

I hope everyone’s keeping safe. Strange times right now. Work is still moving, but everything feels heavier with the COVID-19 pandemic unfolding in the background.

Anyway, let’s get started.

Today, I’m writing about a backend failure that looked like an autoscaling problem at first.

The API dropped a large batch of requests in a few minutes. That was the issue.

I had recently moved a model-backed pipeline into Kubernetes. The goal was resource control: keep enough pods running for normal traffic, then let the cluster add capacity when demand spikes.

Traffic increases. CPU increases. Kubernetes adds pods. More workers handle the load.

The actual system had a rougher shape:

client batch job
→ API gateway
→ Kubernetes ingress
→ active model pods
→ autoscaler notices CPU spike
→ new pods start
→ model loads into memory
→ pods become ready

Kubernetes model-serving flow. Client sends payloads through ingress to active pods. HPA observes CPU load and starts new pods, which must pull the image and load the model before serving traffic. — Kubernetes model-serving flow

The burst arrived faster than useful capacity

My first failure was the traffic pattern.

A downstream batch job dumped a large backlog of text payloads into the API at once. No pacing. No streaming. No backpressure. Just a wall of requests.

The active pods started taking the hit. CPU climbed. The Horizontal Pod Autoscaler reacted and requested more capacity.

That part worked.

The problem was what happened next.

The new pods did not become useful immediately.

A normal web container can start fast. A model-serving container is heavier. It has to pull the image, start the app, load the model, warm the runtime, and only then accept real traffic.

That delay was the cold start penalty.

Cold start timeline. Pod scheduled, image pulled, app boots, model loads into RAM, readiness passes, traffic starts. Highlight the gap between pod exists and pod can serve inference. — Cold start timeline

The system needed more workers right away, but the new workers needed time before they could actually run inference. During that window, requests kept arriving. Some connections waited. Some timed out. The load balancer started returning gateway errors.

The cluster was technically scaling, but the product was still failing.

That is the annoying part of autoscaling model workloads. Scheduling more pods does not mean the system has more useful inference capacity yet.

There is a difference between a pod being created and a pod being ready.

That difference has to be made explicit.

Readiness had to mean model-loaded

My first fix was around readiness.

I added proper startup and readiness checks inside the FastAPI service. Kubernetes needed a reliable way to ask:

“Can this pod actually serve a request now?”

The answer should only be yes after the model is fully loaded and available in memory.

The app needed routes like this:

/health/live
/health/ready

FastAPI health endpoints. /health/live confirms the app process is running. /health/ready only returns 200 after the model is loaded into global state and ready for inference. — FastAPI liveness and readiness checks

The distinction mattered.

A liveness check tells Kubernetes the process is alive.

A readiness check tells Kubernetes the pod can receive traffic.

Those are different for model-serving systems.

Before this, a pod could look alive while still loading the model. That made the ingress layer too optimistic. It could send work to a container that was running but not ready.

After the readiness check, the pod had to prove the model was loaded before receiving payloads.

pod starts
→ app boots
→ model loading begins
→ readiness stays false
→ model loaded
→ readiness returns 200
→ ingress can route traffic

Readiness-gated routing. Ingress holds traffic away from a new pod until /health/ready confirms the model is loaded. — Readiness-gated routing

That fixed the routing problem but it did not fix the burst problem by itself.

The ingress became the shock absorber

The second fix was the traffic funnel.

If a client dumps too much work at once, Kubernetes cannot bend time. New pods still need time to start. Model weights still need to load. Disk and memory still have physical limits.

So the gateway needed to absorb some of the shock.

I added stricter rate limiting at the ingress layer. The goal was simple: stop one bad batch job from flooding the model workers faster than the cluster could react.

incoming burst
→ ingress rate limit
→ controlled request flow
→ active pods continue serving
→ new pods warm up
→ traffic expands gradually

Ingress as shock absorber. Uncontrolled burst enters Nginx ingress. Rate limiting smooths the request flow while new model pods warm up behind readiness checks. — Ingress as shock absorber

This is where the backend boundary mattered again.

The API should not be forced to eat every request at the exact speed a client sends it. If the client has no pacing, the server needs a pacing layer.

Rate limiting was not about punishing the client. It was about protecting the system from a traffic shape it could not safely absorb.

For this setup, I wanted three things:

do not route traffic to cold pods
do not let one batch job flood active workers
do not pretend autoscaling is instant

Example Kubernetes probes and Nginx ingress rate-limit annotations. Show readiness tied to model load state and request throttling at ingress. — Kubernetes probes and ingress rate limits

The model load sequence became part of the contract

The model load sequence also became part of the backend contract.

A model-serving pod was not ready when Python started. It was ready when the model was loaded, the inference function could run, and the readiness endpoint said yes.

Small distinction. Big difference under pressure.

Before:

pod exists
→ ingress may send traffic
→ model still loading
→ request hangs
→ timeout

After:

pod exists
→ model loads
→ readiness passes
→ ingress sends traffic
→ request has a real worker

The edge case was a short traffic spike.

If the burst lasts only a minute, new pods may come online after the worst part is already over. That does not mean autoscaling failed. It means autoscaling reacted slower than the traffic arrived.

For model-serving systems, that is normal. Cold starts are not free.

So the system needed to reduce the blast radius of sudden traffic instead of assuming scale-out would save every request instantly.

The final shape became more stable:

client
→ ingress rate limit
→ ready-only routing
→ active model pods
→ HPA scales new pods
→ new pods load model
→ readiness opens traffic

Final Kubernetes architecture. Client traffic enters Nginx ingress with rate limiting, routes only to ready FastAPI model pods, HPA scales workers, and readiness probes prevent cold pods from receiving inference requests. — Final Kubernetes model-serving architecture

Result

This build reminded me that Kubernetes does not understand machine learning workloads by default. It schedules containers.

It does not know that a pod still needs to load a heavy model into memory. It does not know that a batch job is reckless. It does not know that a pod can be alive and useless for inference at the same time.

I had to teach the system those boundaries.

For me, the useful pieces were simple:

startup checks
readiness checks
model-loaded state
ingress rate limits
controlled request flow
autoscaling with realistic expectations

The API stopped routing traffic to cold pods. Burst traffic had a throttle. New workers had time to warm up. The system still had limits, but the failure mode became cleaner.

A model-serving cluster is only useful when the routing layer respects model startup time.

Otherwise, autoscaling becomes theater.

Onto the next one. Let’s keep sharpening that edge.

First written on March 11, 2020.

Want to implement this architecture in your business?

Discuss Your Project