InfrastructureEngineering Log September 28, 2020 8 min read

Dynamic Batching for Heavy PyTorch Workloads

A BERT classification API had a GPU but still timed out because it fed the model one request at a time. The fix was moving from a custom Python batching queue to TorchServe dynamic batching with per-item validation, request mapping, and serving metrics.

PyTorchTorchServeBERTDynamic BatchingGPU InferenceModel ServingBackend ArchitectureReliability

The API timed out under a traffic spike that should have been manageable.

I was running a BERT text classification model on an AWS g4dn.xlarge. The instance had a GPU, but the request path was feeding the model one text payload at a time. Every HTTP request triggered its own tokenization step, tensor move, forward pass, and response.

The flow looked like this:

HTTP request
→ tokenize text
→ move tensor to GPU
→ run BERT
→ return class label

Sequential BERT inference path. Each HTTP request is tokenized and sent to the GPU one at a time, causing low GPU utilization and request backlog. — Sequential BERT inference path

The serving code was simple, but the workload shape was wrong. BERT is not cheap to run, and sending one short text input per forward pass left too much GPU capacity unused while the API workers queued requests.

The first fix was a custom batching queue

The first fix I tried was a custom Python batching queue.

Incoming requests went into memory. A background thread waited around 50 ms, collected pending items, tokenized them together, ran one batched inference pass, then split the results back to each caller.

incoming requests
→ in-memory queue
→ short wait window
→ batch tokenization
→ one BERT forward pass
→ split responses

Custom dynamic batching queue. Incoming API requests wait briefly in a Python queue, get stacked into a batch, run through BERT together, then return individual responses. — Custom dynamic batching queue

Throughput improved, but the queue became part of the serving infrastructure. I had to manage request IDs, futures, timeouts, malformed inputs, cancelled requests, partial failures, and response routing inside the application layer.

The worst failure was batch poisoning.

One malformed string could fail tokenization for the batch. That meant several unrelated users could receive a 500 because one input was bad. I added validation before enqueueing and kept request IDs attached to each item, but the custom loop kept growing.

At that point, the serving path looked like this:

request arrives
→ validate text
→ assign request ID
→ enqueue future
→ batching thread waits
→ tokenize batch
→ run inference
→ split outputs
→ resolve responses
→ clean up timeouts and failed items

Custom batching complexity. Request IDs, validation, futures, batching thread, timeout cleanup, model inference, and per-request response routing all sit inside the application layer. — Custom batching complexity

TorchServe became the serving boundary

I moved the model into TorchServe.

The new path was smaller:

client request
→ API layer
→ TorchServe endpoint
→ custom BERT handler
→ batched inference
→ response

TorchServe BERT deployment. Client requests hit an API layer, route into TorchServe, pass through a custom BERT handler, get batched for inference, and return individual predictions. — TorchServe BERT deployment

TorchServe let me configure batching with batch_size and max_batch_delay. The handler still had to do the application-specific work: validate text, cap sequence length, tokenize inputs, preserve request mapping, and return per-item errors.

Example TorchServe model configuration using batch_size, max_batch_delay, worker count, and a custom BERT handler for tokenization and response mapping. — TorchServe model configuration

The key rule was simple:

Bad input fails before batch construction.

Empty text, oversized payloads, invalid encodings, and missing fields were rejected per request. Valid requests could still be batched together without one bad input taking down the group.

Example custom TorchServe BERT handler that validates each item, tokenizes valid inputs, preserves request IDs, runs batched inference, and returns per-item errors for invalid records. — Custom BERT handler

The tuning was practical

After migration, the endpoint moved from roughly 15 requests per second to around 120 requests per second under the same class of traffic. The batching window settled around 80 ms for this workload, which was acceptable for the classification flow.

The tuning was mostly practical:

larger batch_size
→ better GPU use
→ higher memory pressure

larger max_batch_delay
→ more complete batches
→ slower individual response

smaller max_batch_delay
→ faster response
→ weaker batching under uneven traffic

Before and after serving metrics. Sequential inference shows low throughput, rising queue time, and underused GPU. TorchServe batching shows higher throughput, stable latency, and better GPU utilization. — Before and after serving metrics

The final setup was:

BERT packaged for TorchServe
custom handler for preprocessing
input validation before tokenization
sequence length caps
dynamic batching
per-item error handling
serving metrics watched under burst traffic

Final BERT serving setup. TorchServe hosts the BERT model with a custom handler, validates inputs, batches requests, runs GPU inference, returns per-item responses, and exposes serving metrics. — Final BERT serving setup

Result

The useful lesson from this build:

The expensive part was already paid for. The GPU was there. The waste came from feeding it work in the wrong shape.

The custom queue proved the batching idea, then became the thing to remove.

TorchServe gave the model a better serving boundary. The application layer went back to handling product concerns. The inference server handled batching, workers, model loading, and serving metrics.

For this workload, the win was straightforward:

batch size 1
→ low throughput and request backlog

dynamic batching
→ higher throughput, cleaner failure handling, better GPU use

Onto the next one. Let’s keep sharpening that edge.

First written on September 28, 2020.

Want to implement this architecture in your business?

Discuss Your Project