AI SystemsEngineering Log May 20, 2024 8 min read

Small Language Models at the Edge

Three narrow healthcare classification paths moved off GPT-4 because the task was bounded, repetitive, and sensitive. The better architecture was local small-model inference inside the client VPC with validation, rules, audit metadata, and explicit fallbacks.

Small Language ModelsBackend ArchitectureHealthcare AIPII DetectionLocal InferenceValidationEvaluation

I moved three classification paths off GPT-4 today.

That sounds more dramatic than it is. The tasks were narrow, repetitive, and already well-defined. Entity detection. PII review. Short text classification. The kind of work where a frontier model feels impressive during the prototype, then starts looking wasteful once the traffic becomes steady.

The current habit is to throw the strongest model at everything.

It works.
It also hides bad architecture.

For this build, the constraint was healthcare data. The workflow needed to detect and scrub sensitive information before text moved deeper into the application. Names, contact details, patient identifiers, insurance references, medical record fragments, all the stuff you don’t want casually leaving the client’s environment.

The requirements were tight:

raw text stays inside the client VPC
classification latency stays near real time
the UI cannot block for seconds
outputs must be structured
failures must be visible

The old path used a hosted frontier model:

app
→ API
→ outbound model call
→ classification response
→ scrubbed text

It was accurate enough, but the shape was wrong for this workload.

Every request crossed a third-party boundary. Every classification carried network latency. Every traffic spike turned into token spend. For a low-volume prototype, fine. For high-volume sensitive text flowing through a healthcare system, less fine.

Llama 3 8B and Phi-3 changed the conversation this month. Not because they magically replace frontier models, but because they make small, bounded local inference worth taking seriously. Llama 3 8B was released in April, and Phi-3-mini is small enough to make local deployment part of the design discussion instead of a research toy.

So I changed the architecture.

The model moved closer to the data.

client VPC
  |
  |-- application API
  |-- scrubber service
  |-- local SLM inference endpoint
  |-- audit log
  |-- fallback queue

Image slot

Healthcare PII scrubber architecture inside a client VPC. Application API sends text to a scrubber service, which calls a local small language model endpoint, applies validation rules, writes audit metadata, and returns structured redaction output.

The scrubber service became its own backend boundary.

It didn’t expose a chat interface. It didn’t take open-ended instructions. It accepted a narrow payload and returned a narrow result.

{
  "request_id": "req_...",
  "text": "...",
  "task": "pii_detection",
  "output_schema": "pii_spans_v1"
}

The response had to be structured:

{
  "contains_pii": true,
  "spans": [
    {
      "type": "patient_name",
      "start": 18,
      "end": 31,
      "confidence": 0.91
    }
  ],
  "model_version": "llama3-8b-quantized",
  "policy_version": "pii_rules_v1"
}

Code slot

PII scrubber request and response schema with request ID, text input, detected spans, confidence score, model version, policy version, and validation errors.

This is where small models make sense.

The task is bounded.
The schema is fixed.
The expected labels are known.
The system can be tested against a local eval set.

I used a quantized Llama 3 8B path for the first deployment test. Phi-3 was also worth evaluating, especially for structured reasoning on compact prompts, but the production move here was not “small model beats GPT-4.” The move was narrower: a small local model can handle a specific classification job well enough when the workflow is constrained and validated.

The inference path looked like this:

incoming text
→ normalize
→ local SLM classification
→ schema validation
→ deterministic rule checks
→ redaction spans returned
→ audit metadata written

Code slot

Scrubber pipeline that normalizes text, calls a local quantized SLM, validates JSON output, applies deterministic PII rules, and returns redaction spans.

The deterministic checks stayed in the system.

That part matters. The model could identify likely PII, but the backend still needed rules for obvious patterns: emails, phone numbers, ID-like strings, dates of birth, and known local formats. The SLM handled ambiguous language. The rules handled predictable patterns. The validator checked that the output made sense before anything downstream trusted it.

The deployment also needed normal service controls:

max input length
timeout per inference call
model version in every response
policy version in every response
fallback behavior on invalid JSON
audit metadata without raw text
queue for retryable failures
health check for the local inference endpoint

The privacy improvement came from the serving location. Raw healthcare text stayed inside the client VPC during inference. That doesn’t remove all privacy work. Logs still need to avoid raw payloads. Access still needs to be scoped. Retention still needs limits. But the highest-risk movement, sending raw patient-adjacent text to a public model API, was removed from this path.

Latency improved because the request no longer crossed the public model boundary. The exact number depends on hardware, quantization, prompt length, and batching, but the user-facing difference was clear: the scrubber became fast enough to sit inside the workflow instead of feeling like an external review step.

Cost changed too. Hosted frontier models are easy to start with, but high-volume classification turns every small request into token spend. A dedicated small-model endpoint has a steadier cost profile. You pay for the instance and the operational work. That tradeoff is much easier to justify when the task runs constantly and the data cannot leave the environment.

The eval set became the release gate.

Before moving each classification path, I checked:

precision on known PII
recall on known PII
false positives on normal medical text
invalid JSON rate
latency p95
timeout rate
fallback rate

Code slot

Local eval runner comparing hosted frontier output and local SLM output on a healthcare PII classification set, tracking precision, recall, invalid JSON rate, latency, and fallback count.

This is the part that keeps the architecture honest. Small models are useful when the task is narrow enough to measure. Without evals, “local AI” becomes another way to ship vibes into production.

For these three classification paths, moving off GPT-4 made the system cleaner:

raw text stayed in the VPC
latency became easier to control
cost became less tied to token volume
model behavior became versioned
validation moved closer to the API
fallbacks became explicit

I’m not taking this as a sign that frontier models are unnecessary. They’re still the right tool for broad reasoning, complex synthesis, and open-ended work.

But a lot of production AI work is not open-ended.

It’s classification, extraction, routing, validation, and cleanup inside existing systems. For those paths, a smaller model running near the data can be the better engineering choice.

First written on May 20, 2024.

Want to implement this architecture in your business?

Discuss Your Project