Back to Blog
Streaming SystemsEngineering Log August 12, 2024 8 min read

Real-Time Multimodal Pipelines for Audio and Video Streams

A live accessibility prototype started lagging after two minutes because the backend treated video like repeated static uploads. The fix was not a bigger model. It was buffering, keyframe selection, stale-work dropping, VAD, and bounded inference.

Backend ArchitectureMultimodal AILatencyQueueingAccessibilityVAD

I’ve been testing live multimodal workflows lately, mostly around accessibility.

Static image upload is one thing. A user sends a screenshot, the backend validates it, the model reads it, and the app returns a description.

Live video is different. It doesn’t wait for the backend to catch up.

This accessibility prototype had a simple goal: describe what’s happening in a video feed and transcribe useful audio for low-vision users. Not a full self-driving perception system. More like a practical assistant that can say what changed, what’s nearby, and what the user may need to pay attention to.

The system lagged by eight seconds after only two minutes of live video.

That was enough to make the whole thing useless.

The first version treated video like a fast sequence of image uploads.

camera frame
→ frontend capture
→ API request
→ multimodal model call
→ description
→ frontend update

At low volume, it looked fine.

Once the stream kept running, the math got ugly.

The frontend captured a frame every second. Each reasoning step took longer than that. Some calls finished in 1.5 seconds. Some took closer to 2 seconds. A queue started forming immediately.

After a few minutes, the backend was describing frames that were already stale.

The user had moved. The room had changed. The model was still explaining the past.

I changed the pipeline around one rule:

The stream can capture more than the model should process.

That rule changed the whole architecture.

video capture
→ frame buffer
→ keyframe selector
→ sliding context window
→ multimodal reasoning call
→ current description
Image slot

Real-time multimodal pipeline showing video capture, rolling frame buffer, keyframe selector, sliding context window, audio VAD, transcription path, multimodal reasoning step, stale-frame drop policy, and frontend update.

The backend stopped treating every frame as a job.

It kept a rolling buffer instead. Frames entered the buffer continuously, but only selected keyframes moved into the heavier inference path. The system could capture at a higher rate for smoothness while reasoning at a lower rate for stability.

The first useful version selected one keyframe roughly every two seconds.

Not perfect. Much better than pretending every frame deserved a model call.

The request sent to the model also changed. A single frame can miss motion. Three recent keyframes give more shape:

T-4s: door mostly open
T-2s: door halfway closed
T: door nearly closed

That lets the model describe change, not just objects.

Code slot

Sliding-window keyframe selector that keeps a rolling frame buffer, samples one keyframe every N seconds, drops stale frames, and builds a three-frame context packet.

The important backend behavior was dropping stale work.

A normal queue tries to process everything eventually. That is wrong for live video. If a frame is too old, processing it creates a worse user experience than skipping it.

So the queue needed a freshness rule:

if frame_age_ms > MAX_FRAME_AGE:
    drop frame
else:
    send to inference

This felt uncomfortable at first because dropping data sounds like failure. For live systems, stale data is already failed data. Keeping it just hides the failure inside latency.

Audio needed a separate path.

The first version sent audio chunks too aggressively. Silence, fan noise, keyboard clicks, background movement, all of it created unnecessary work.

I added voice activity detection before transcription.

microphone input
→ audio chunks
→ VAD
→ speech segment only
→ transcription
→ merged with visual context
Code slot

Audio pipeline using VAD to ignore silence/background noise, segment speech, send only voiced chunks to transcription, and attach transcript snippets to the multimodal context.

That helped in two ways.

It reduced compute, and it made the final context cleaner. The model didn’t need a transcript full of empty segments or background noise markers. It needed speech that actually mattered.

The combined context packet became small and deliberate:

{
  "session_id": "live_...",
  "frame_window": ["t_minus_4s", "t_minus_2s", "current"],
  "latest_transcript": "Can you tell me what is in front of me?",
  "previous_description": "You are facing a hallway with a closed door on the left.",
  "task": "describe_current_scene_for_accessibility"
}
Code slot

Multimodal context packet schema with sampled keyframes, latest transcript segment, previous description, timestamp metadata, and freshness limits.

The pipeline became less like request-response and more like traffic control.

Some data is accepted.
Some data is sampled.
Some data is dropped.
Some data is summarized.
Only the useful slice reaches the heavy model call.

That reduced lag from several seconds to near-real-time behavior in the prototype. The model still had latency. The difference was that the system stopped feeding it work faster than it could answer.

The service boundaries ended up like this:

capture layer
  handles camera and microphone input

buffer layer
  stores recent frames and audio chunks

selection layer
  chooses keyframes and speech segments

reasoning layer
  calls the multimodal model

presentation layer
  updates the user with current descriptions

Each layer had a different job. The model was only one part of the loop.

This is the part I’m taking seriously with real-time multimodal systems. The hard problem is not just “can the model understand a frame?” It is whether the backend can keep the model looking at the right moment.

For accessibility work, old output can be worse than no output. A delayed description can make the user trust stale context. That changes the engineering priorities.

The system needs:

  • frame freshness limits
  • stale-work dropping
  • keyframe sampling
  • sliding context windows
  • VAD before transcription
  • separate audio and video paths
  • bounded inference rate
  • clear latency metrics
  • safe fallback when the model falls behind

A live stream is not a folder of images. It is continuous pressure on the backend.

The architecture has to decide what matters right now, what can be ignored, and what should never reach the expensive model path.

First written on August 12, 2024.

Want to implement this architecture in your business?

Discuss Your Project