InfrastructureEngineering Log August 25, 2022 8 min read

Queue-Isolated Stable Diffusion on Local GPU Hardware

Stable Diffusion could run locally, but serving it behind an API broke when normal web concurrency reached the GPU. The fix was separating request intake from GPU execution with FastAPI, Redis, Celery, object storage, and single-worker queue isolation.

Stable DiffusionGPU WorkloadsQueueingFastAPIRedisCeleryObject StorageBackend Architecture

Stable Diffusion released its weights this week, and the internet is moving fast.

Everywhere I look, people are running notebooks, sharing generated images, and testing prompts on local machines. It’s exciting. It also makes the deployment problem look easier than it really is.

Running a model once is one thing. Serving it behind an API is a different problem.

The GPU panicked and killed the container when two users clicked Generate at almost the same time.

The model itself could run. That wasn’t the issue. The serving path let normal web concurrency reach a GPU workload that couldn’t safely handle it.

The first version looked like a standard API:

client submits prompt
→ FastAPI receives request
→ model generates image
→ image returns in response

That shape works for normal backend work. It breaks down when the request needs a large model already sitting in GPU memory.

The checkpoint was several gigabytes. Loading it from disk on every request was too slow, so the model had to stay warm in process memory. Once loaded, a single 512x512 generation could consume a large chunk of VRAM. One request was fine. Two overlapping requests pushed the process into CUDA out-of-memory territory.

The web server saw two normal HTTP requests.

The GPU saw two heavy jobs competing for the same constrained hardware.

The HTTP layer stopped touching the model

I stopped letting the HTTP layer touch the model directly.

New flow:

client submits prompt
→ API validates prompt
→ job ID is created
→ prompt is pushed to Redis
→ API returns immediately
→ Celery worker pulls one job
→ Stable Diffusion runs on the GPU
→ image uploads to object storage
→ job status updates
→ frontend fetches or receives the result

Stable Diffusion backend architecture showing FastAPI, Redis queue, single GPU worker, object storage, job table, and frontend status retrieval. — Queue-isolated Stable Diffusion architecture

The main change was separating request intake from GPU execution.

FastAPI became the intake layer. It validated prompts, created job records, pushed work into Redis, and returned a job ID. It no longer imported the model or ran diffusion inside the request handler.

The GPU instance ran a single Celery worker with sole access to the model. That worker loaded the weights once at startup, kept them warm, pulled one job at a time, generated the image, wrote the result to storage, then moved to the next job.

Concurrency moved to the queue.
The GPU stayed single-file.

That tradeoff was intentional. Ten users could submit prompts at once, but only one generation touched the GPU at a time. The rest waited in Redis instead of crashing the container.

FastAPI endpoint that validates a prompt, creates a generation job, pushes it to Redis and Celery, and returns a job ID without touching the GPU. — FastAPI generation intake endpoint

The worker startup became part of the contract

I also changed the model loading path. The worker loaded the weights in FP16 instead of FP32. That lowered the VRAM pressure enough to keep the model warm on a standard cloud GPU without wasting memory on precision the output didn’t visibly need.

The worker startup became part of the deployment contract:

load model once
confirm GPU availability
use FP16
reject startup if weights are missing
expose worker health
process one job at a time
write failure reason back to the job record

Celery GPU worker that loads Stable Diffusion once on startup, processes one queued job at a time, uploads the image to object storage, and updates job status. — Single-worker GPU execution path

The job record replaced the HTTP request as state

The job record handled the state the HTTP request could no longer hold:

job ID
prompt reference
status
queued time
started time
completed time
failure reason
storage URL
model version
image parameters

This made the system easier to operate. If the worker crashed, the failure could be attached to the job. If the browser disconnected, the job could still finish. If the queue backed up, the API could still accept work while showing the user that generation was pending.

The hard part was accepting that the GPU should not behave like a web server.

A web server is comfortable with concurrency. A single GPU running a large diffusion model needs stricter control. Letting multiple request handlers hit the model directly turns normal user behavior into infrastructure failure.

Open weights still need backend boundaries

The open weights are a big shift. The economics change when a model can run on your own hardware. You’re no longer only renting someone else’s API endpoint for every call.

The backend obligations stay real:

keep the model warm
bound GPU concurrency
queue excess work
track job state
store outputs outside the worker
expose progress clearly
make failure recoverable
protect the API from the compute layer

Result

After the change, the system became less exciting in the right way. Multiple users could submit prompts without killing the server. The queue absorbed bursts. The GPU worked through jobs one at a time. Slow requests became waiting jobs instead of crashed containers.

That’s the real difference between a local notebook and a production-facing image API.

The notebook proves the model can run.

The backend proves it can survive users.

Onto the next one. Let’s keep sharpening that edge.

First written on August 25, 2022.

Want to implement this architecture in your business?

Discuss Your Project