AI SystemsEngineering Log December 15, 2023 7 min read

Designing a Multimodal Routing Layer Before Vision Access

A JPEG broke the old text-only chat assumption before any vision model was involved. The fix was a multimodal request envelope, file validation, route classification, and provider-agnostic capability routing.

Multimodal AIBackend ArchitectureAPI DesignFile ValidationModel RoutingCost ControlProvider Abstraction

Gemini is everywhere this week.

The bigger shift is pretty obvious now. Users are going to stop sending only text.

They’re going to drop screenshots, charts, receipts, diagrams, PDFs, product photos, support logs, and messy combinations of all of them into the same chat box.

The old assumption was simple:

user message = string

That assumption is starting to break.

The first failure in my own interface came from a JPEG.

A user uploaded a screenshot of a flowchart. The frontend treated it like another chat message and pushed the payload forward. The backend expected text. The parser tried to handle the image path like a string input and failed before any model call happened.

No external vision model was involved yet. The request schema was already wrong.

The old flow only worked for text

The old flow looked like this:

client sends message
→ API validates text
→ prompt gets assembled
→ text model receives prompt
→ response returns

That flow works when every input is text. It becomes fragile once the chat box accepts files.

So I started changing the API boundary before wiring in any live vision endpoint.

The message became an envelope

The new shape treats a chat message as an envelope:

message
  session_id
  user_text
  content_parts[]
    - text
    - image_reference
    - document_reference
    - mime_type
    - size
    - upload_id

Multimodal-ready chat intake showing a client message envelope, API validation, file storage, content-type inspection, route classification, and separate text, image, and document processing paths. — Multimodal-ready chat intake

That changed the backend responsibility.

The API gateway now inspects the request before deciding what kind of work it is. Plain text stays on the normal text path. Images go through file validation and storage first. Unsupported files are rejected before they reach any model layer.

For now, the router only needs to classify the request:

incoming request
→ validate session
→ inspect content parts
→ text only: normal text route
→ image included: store and mark as vision-required
→ unsupported file: reject
→ response: accepted, rejected, or queued

FastAPI request schema for multimodal-ready chat messages using content parts, file references, MIME type, file size, upload ID, and route classification. — Multimodal request schema

Base64 is not the default boundary

The biggest change is avoiding base64 blobs as casual JSON.

Base64 looks convenient in a prototype. It gets ugly fast. Large payloads hit request limits, inflate logs, increase memory pressure, and make retries heavier than they need to be. A chat API should not pass a screenshot around like a normal text field.

The cleaner path is:

image upload
→ object storage
→ file record created
→ chat message references file ID
→ backend validates ownership and MIME type
→ router marks request for vision-capable model

That keeps the message small and gives the backend a real file boundary.

Validation comes before model routing

The validation layer has to exist before model routing:

allowed MIME types
max file size
image decode check
session ownership
upload expiry
file scan status
safe logging
route classification
clear rejection reason

File intake validator that checks MIME type, size, decode validity, session ownership, and returns a normalized file reference for routing. — File intake validator

Provider choice stays behind the router

This also keeps provider choice from leaking into the whole app.

The route table can stay simple:

text only
→ text model

text + image
→ vision-capable model

document upload
→ extraction pipeline
→ retrieval or summarization route

unsupported file
→ reject early

The model names can change. Gemini, GPT-4 with vision access, Claude, local vision models, whatever comes next. The backend should classify the work first, then select the model based on capability, cost, latency, and availability.

Provider-agnostic multimodal router showing content inspection, capability matching, model provider selection, fallback path, and shared response formatter. — Provider-agnostic multimodal router

Cost control starts at classification

This is also where cost control starts.

A plain text question should not accidentally take the expensive vision path. An image upload should not crash the text parser. A malformed file should not reach the model provider. A provider outage should not corrupt the chat session.

Multimodal AI makes the interface feel simpler. The user drops whatever they have into the box.

The backend has to get stricter because of that.

A chat system now needs content envelopes, upload storage, file validation, route classification, capability matching, and fallback behavior. The model endpoint is only one piece. The real work starts at the API boundary where the system decides what the user actually sent.

Onto the next one. Let’s keep sharpening that edge.

First written on December 15, 2023.

Want to implement this architecture in your business?

Discuss Your Project