Back to Blog
InfrastructureEngineering Log September 28, 2020 8 min read

Dynamic Batching for Heavy PyTorch Workloads

A BERT classification API had a GPU but still timed out because it fed the model one request at a time. The fix was moving from a custom Python batching queue to TorchServe dynamic batching with per-item validation, request mapping, and serving metrics.

PyTorchTorchServeBERTDynamic BatchingGPU InferenceModel ServingBackend ArchitectureReliability

The API timed out under a traffic spike that should have been manageable.

I was running a BERT text classification model on an AWS g4dn.xlarge. The instance had a GPU, but the request path was feeding the model one text payload at a time. Every HTTP request triggered its own tokenization step, tensor move, forward pass, and response.

The flow looked like this:

HTTP request
→ tokenize text
→ move tensor to GPU
→ run BERT
→ return class label
Sequential BERT inference path. Each HTTP request is tokenized and sent to the GPU one at a time, causing low GPU utilization and request backlog.
Sequential BERT inference path

The serving code was simple, but each request paid its own tokenization, tensor transfer, and forward-pass cost.

The serving code was simple, but the workload shape was wrong. BERT is not cheap to run, and sending one short text input per forward pass left too much GPU capacity unused while the API workers queued requests.

The first fix was a custom batching queue

The first fix I tried was a custom Python batching queue.

Incoming requests went into memory. A background thread waited around 50 ms, collected pending items, tokenized them together, ran one batched inference pass, then split the results back to each caller.

incoming requests
→ in-memory queue
→ short wait window
→ batch tokenization
→ one BERT forward pass
→ split responses
Custom dynamic batching queue. Incoming API requests wait briefly in a Python queue, get stacked into a batch, run through BERT together, then return individual responses.
Custom dynamic batching queue

A short batching window improved throughput by feeding the GPU work in a better shape.

Throughput improved, but the queue became part of the serving infrastructure. I had to manage request IDs, futures, timeouts, malformed inputs, cancelled requests, partial failures, and response routing inside the application layer.

The worst failure was batch poisoning.

One malformed string could fail tokenization for the batch. That meant several unrelated users could receive a 500 because one input was bad. I added validation before enqueueing and kept request IDs attached to each item, but the custom loop kept growing.

At that point, the serving path looked like this:

request arrives
→ validate text
→ assign request ID
→ enqueue future
→ batching thread waits
→ tokenize batch
→ run inference
→ split outputs
→ resolve responses
→ clean up timeouts and failed items
Custom batching complexity. Request IDs, validation, futures, batching thread, timeout cleanup, model inference, and per-request response routing all sit inside the application layer.
Custom batching complexity

The custom queue proved the batching idea, then became too much serving infrastructure inside the application layer.

TorchServe became the serving boundary

I moved the model into TorchServe.

The new path was smaller:

client request
→ API layer
→ TorchServe endpoint
→ custom BERT handler
→ batched inference
→ response
TorchServe BERT deployment. Client requests hit an API layer, route into TorchServe, pass through a custom BERT handler, get batched for inference, and return individual predictions.
TorchServe BERT deployment

The application layer returned to product concerns while the inference server owned batching, workers, model loading, and serving metrics.

TorchServe let me configure batching with batch_size and max_batch_delay. The handler still had to do the application-specific work: validate text, cap sequence length, tokenize inputs, preserve request mapping, and return per-item errors.

Example TorchServe model configuration using batch_size, max_batch_delay, worker count, and a custom BERT handler for tokenization and response mapping.
TorchServe model configuration

The serving config controlled batching delay, batch size, workers, and handler wiring.

The key rule was simple:

Bad input fails before batch construction.

Empty text, oversized payloads, invalid encodings, and missing fields were rejected per request. Valid requests could still be batched together without one bad input taking down the group.

Example custom TorchServe BERT handler that validates each item, tokenizes valid inputs, preserves request IDs, runs batched inference, and returns per-item errors for invalid records.
Custom BERT handler

The handler validated each item before batch construction, preserved request IDs, and returned per-item errors instead of poisoning the batch.

The tuning was practical

After migration, the endpoint moved from roughly 15 requests per second to around 120 requests per second under the same class of traffic. The batching window settled around 80 ms for this workload, which was acceptable for the classification flow.

The tuning was mostly practical:

larger batch_size
→ better GPU use
→ higher memory pressure

larger max_batch_delay
→ more complete batches
→ slower individual response

smaller max_batch_delay
→ faster response
→ weaker batching under uneven traffic
Before and after serving metrics. Sequential inference shows low throughput, rising queue time, and underused GPU. TorchServe batching shows higher throughput, stable latency, and better GPU utilization.
Before and after serving metrics

Dynamic batching increased throughput and GPU utilization while keeping latency acceptable for classification.

The final setup was:

  • BERT packaged for TorchServe
  • custom handler for preprocessing
  • input validation before tokenization
  • sequence length caps
  • dynamic batching
  • per-item error handling
  • serving metrics watched under burst traffic
Final BERT serving setup. TorchServe hosts the BERT model with a custom handler, validates inputs, batches requests, runs GPU inference, returns per-item responses, and exposes serving metrics.
Final BERT serving setup

The final serving boundary handled batching, validation, request mapping, GPU inference, and metrics without custom queue logic inside the API.

Result

The useful lesson from this build:

The expensive part was already paid for. The GPU was there. The waste came from feeding it work in the wrong shape.

The custom queue proved the batching idea, then became the thing to remove.

TorchServe gave the model a better serving boundary. The application layer went back to handling product concerns. The inference server handled batching, workers, model loading, and serving metrics.

For this workload, the win was straightforward:

batch size 1
→ low throughput and request backlog

dynamic batching
→ higher throughput, cleaner failure handling, better GPU use

Onto the next one. Let’s keep sharpening that edge.

First written on September 28, 2020.

Want to implement this architecture in your business?

Discuss Your Project