Queue-Isolated Stable Diffusion on Local GPU Hardware
Stable Diffusion could run locally, but serving it behind an API broke when normal web concurrency reached the GPU. The fix was separating request intake from GPU execution with FastAPI, Redis, Celery, object storage, and single-worker queue isolation.
Stable Diffusion released its weights this week, and the internet is moving fast.
Everywhere I look, people are running notebooks, sharing generated images, and testing prompts on local machines. It’s exciting. It also makes the deployment problem look easier than it really is.
Running a model once is one thing. Serving it behind an API is a different problem.
The GPU panicked and killed the container when two users clicked Generate at almost the same time.
The model itself could run. That wasn’t the issue. The serving path let normal web concurrency reach a GPU workload that couldn’t safely handle it.
The first version looked like a standard API:
client submits prompt
→ FastAPI receives request
→ model generates image
→ image returns in response
That shape works for normal backend work. It breaks down when the request needs a large model already sitting in GPU memory.
The checkpoint was several gigabytes. Loading it from disk on every request was too slow, so the model had to stay warm in process memory. Once loaded, a single 512x512 generation could consume a large chunk of VRAM. One request was fine. Two overlapping requests pushed the process into CUDA out-of-memory territory.
The web server saw two normal HTTP requests.
The GPU saw two heavy jobs competing for the same constrained hardware.
The HTTP layer stopped touching the model
I stopped letting the HTTP layer touch the model directly.
New flow:
client submits prompt
→ API validates prompt
→ job ID is created
→ prompt is pushed to Redis
→ API returns immediately
→ Celery worker pulls one job
→ Stable Diffusion runs on the GPU
→ image uploads to object storage
→ job status updates
→ frontend fetches or receives the result
FastAPI handled intake. Redis absorbed bursts. A single GPU worker owned model execution.
The main change was separating request intake from GPU execution.
FastAPI became the intake layer. It validated prompts, created job records, pushed work into Redis, and returned a job ID. It no longer imported the model or ran diffusion inside the request handler.
The GPU instance ran a single Celery worker with sole access to the model. That worker loaded the weights once at startup, kept them warm, pulled one job at a time, generated the image, wrote the result to storage, then moved to the next job.
Concurrency moved to the queue.
The GPU stayed single-file.
That tradeoff was intentional. Ten users could submit prompts at once, but only one generation touched the GPU at a time. The rest waited in Redis instead of crashing the container.
The request handler created a job and returned immediately instead of running model inference inside the HTTP path.
The worker startup became part of the contract
I also changed the model loading path. The worker loaded the weights in FP16 instead of FP32. That lowered the VRAM pressure enough to keep the model warm on a standard cloud GPU without wasting memory on precision the output didn’t visibly need.
The worker startup became part of the deployment contract:
- load model once
- confirm GPU availability
- use FP16
- reject startup if weights are missing
- expose worker health
- process one job at a time
- write failure reason back to the job record
The worker owned the warm model process, processed one generation at a time, and wrote success or failure back to the job record.
The job record replaced the HTTP request as state
The job record handled the state the HTTP request could no longer hold:
- job ID
- prompt reference
- status
- queued time
- started time
- completed time
- failure reason
- storage URL
- model version
- image parameters
This made the system easier to operate. If the worker crashed, the failure could be attached to the job. If the browser disconnected, the job could still finish. If the queue backed up, the API could still accept work while showing the user that generation was pending.
The hard part was accepting that the GPU should not behave like a web server.
A web server is comfortable with concurrency. A single GPU running a large diffusion model needs stricter control. Letting multiple request handlers hit the model directly turns normal user behavior into infrastructure failure.
Open weights still need backend boundaries
The open weights are a big shift. The economics change when a model can run on your own hardware. You’re no longer only renting someone else’s API endpoint for every call.
The backend obligations stay real:
- keep the model warm
- bound GPU concurrency
- queue excess work
- track job state
- store outputs outside the worker
- expose progress clearly
- make failure recoverable
- protect the API from the compute layer
Result
After the change, the system became less exciting in the right way. Multiple users could submit prompts without killing the server. The queue absorbed bursts. The GPU worked through jobs one at a time. Slow requests became waiting jobs instead of crashed containers.
That’s the real difference between a local notebook and a production-facing image API.
The notebook proves the model can run.
The backend proves it can survive users.
Onto the next one. Let’s keep sharpening that edge.
First written on August 25, 2022.
Want to implement this architecture in your business?
Discuss Your Project