Small Language Models at the Edge
Three narrow healthcare classification paths moved off GPT-4 because the task was bounded, repetitive, and sensitive. The better architecture was local small-model inference inside the client VPC with validation, rules, audit metadata, and explicit fallbacks.
I moved three classification paths off GPT-4 today.
That sounds more dramatic than it is. The tasks were narrow, repetitive, and already well-defined. Entity detection. PII review. Short text classification. The kind of work where a frontier model feels impressive during the prototype, then starts looking wasteful once the traffic becomes steady.
The current habit is to throw the strongest model at everything.
It works.
It also hides bad architecture.
For this build, the constraint was healthcare data. The workflow needed to detect and scrub sensitive information before text moved deeper into the application. Names, contact details, patient identifiers, insurance references, medical record fragments, all the stuff you don’t want casually leaving the client’s environment.
The requirements were tight:
- raw text stays inside the client VPC
- classification latency stays near real time
- the UI cannot block for seconds
- outputs must be structured
- failures must be visible
The old path used a hosted frontier model:
app
→ API
→ outbound model call
→ classification response
→ scrubbed text
It was accurate enough, but the shape was wrong for this workload.
Every request crossed a third-party boundary. Every classification carried network latency. Every traffic spike turned into token spend. For a low-volume prototype, fine. For high-volume sensitive text flowing through a healthcare system, less fine.
Llama 3 8B and Phi-3 changed the conversation this month. Not because they magically replace frontier models, but because they make small, bounded local inference worth taking seriously. Llama 3 8B was released in April, and Phi-3-mini is small enough to make local deployment part of the design discussion instead of a research toy.
So I changed the architecture.
The model moved closer to the data.
client VPC
|
|-- application API
|-- scrubber service
|-- local SLM inference endpoint
|-- audit log
|-- fallback queue
Healthcare PII scrubber architecture inside a client VPC. Application API sends text to a scrubber service, which calls a local small language model endpoint, applies validation rules, writes audit metadata, and returns structured redaction output.
The scrubber service became its own backend boundary.
It didn’t expose a chat interface. It didn’t take open-ended instructions. It accepted a narrow payload and returned a narrow result.
{
"request_id": "req_...",
"text": "...",
"task": "pii_detection",
"output_schema": "pii_spans_v1"
}
The response had to be structured:
{
"contains_pii": true,
"spans": [
{
"type": "patient_name",
"start": 18,
"end": 31,
"confidence": 0.91
}
],
"model_version": "llama3-8b-quantized",
"policy_version": "pii_rules_v1"
}
PII scrubber request and response schema with request ID, text input, detected spans, confidence score, model version, policy version, and validation errors.
This is where small models make sense.
The task is bounded.
The schema is fixed.
The expected labels are known.
The system can be tested against a local eval set.
I used a quantized Llama 3 8B path for the first deployment test. Phi-3 was also worth evaluating, especially for structured reasoning on compact prompts, but the production move here was not “small model beats GPT-4.” The move was narrower: a small local model can handle a specific classification job well enough when the workflow is constrained and validated.
The inference path looked like this:
incoming text
→ normalize
→ local SLM classification
→ schema validation
→ deterministic rule checks
→ redaction spans returned
→ audit metadata written
Scrubber pipeline that normalizes text, calls a local quantized SLM, validates JSON output, applies deterministic PII rules, and returns redaction spans.
The deterministic checks stayed in the system.
That part matters. The model could identify likely PII, but the backend still needed rules for obvious patterns: emails, phone numbers, ID-like strings, dates of birth, and known local formats. The SLM handled ambiguous language. The rules handled predictable patterns. The validator checked that the output made sense before anything downstream trusted it.
The deployment also needed normal service controls:
- max input length
- timeout per inference call
- model version in every response
- policy version in every response
- fallback behavior on invalid JSON
- audit metadata without raw text
- queue for retryable failures
- health check for the local inference endpoint
The privacy improvement came from the serving location. Raw healthcare text stayed inside the client VPC during inference. That doesn’t remove all privacy work. Logs still need to avoid raw payloads. Access still needs to be scoped. Retention still needs limits. But the highest-risk movement, sending raw patient-adjacent text to a public model API, was removed from this path.
Latency improved because the request no longer crossed the public model boundary. The exact number depends on hardware, quantization, prompt length, and batching, but the user-facing difference was clear: the scrubber became fast enough to sit inside the workflow instead of feeling like an external review step.
Cost changed too. Hosted frontier models are easy to start with, but high-volume classification turns every small request into token spend. A dedicated small-model endpoint has a steadier cost profile. You pay for the instance and the operational work. That tradeoff is much easier to justify when the task runs constantly and the data cannot leave the environment.
The eval set became the release gate.
Before moving each classification path, I checked:
- precision on known PII
- recall on known PII
- false positives on normal medical text
- invalid JSON rate
- latency p95
- timeout rate
- fallback rate
Local eval runner comparing hosted frontier output and local SLM output on a healthcare PII classification set, tracking precision, recall, invalid JSON rate, latency, and fallback count.
This is the part that keeps the architecture honest. Small models are useful when the task is narrow enough to measure. Without evals, “local AI” becomes another way to ship vibes into production.
For these three classification paths, moving off GPT-4 made the system cleaner:
- raw text stayed in the VPC
- latency became easier to control
- cost became less tied to token volume
- model behavior became versioned
- validation moved closer to the API
- fallbacks became explicit
I’m not taking this as a sign that frontier models are unnecessary. They’re still the right tool for broad reasoning, complex synthesis, and open-ended work.
But a lot of production AI work is not open-ended.
It’s classification, extraction, routing, validation, and cleanup inside existing systems. For those paths, a smaller model running near the data can be the better engineering choice.
First written on May 20, 2024.
Want to implement this architecture in your business?
Discuss Your Project